AI-Powered Document Classification with paperless-ai and Ollama

Table of Contents

This post is a complete runbook for integrating AI-powered auto-tagging and classification into paperless-ngx using paperless-ai and a locally-running Ollama instance. The setup uses a local LLM to read document text and automatically populate metadata fields — title, document type, tags, correspondent, date, and custom fields.

Hardware and Architecture
#

NAS (Synology DS1621+, 10.0.10.10): runs paperless-ngx on port 5656
Desktop PC: Windows with WSL2, Docker Desktop, RTX 4090
Goal: AI auto-tagging/classification using a local LLM, zero cloud dependency

The key architecture decision is a pull model: paperless-ai runs in WSL2 Docker, polls the paperless-ngx API for documents tagged ai-pending, processes them with Ollama, and writes metadata back. This is the correct approach for a desktop that is not on 24/7 — the NAS holds the queue and the desktop drains it when available.

1
2
3
4
5
6
7
paperless-ngx (NAS)
       ↑  ↓  (REST API)
 paperless-ai (WSL2 Docker)
       ↑  ↓  (HTTP)
    Ollama (Windows native)
       ↑
    RTX 4090 (GPU)

Ollama runs natively on Windows (not in WSL) for best GPU access. From inside a Docker container in WSL2, it is reachable via the special hostname host.docker.internal.

Prerequisites
#

paperless-ngx running and accessible via API
Docker Desktop installed on Windows with WSL2 integration enabled
Ollama installed on Windows

Step 1 — Set Up Ollama to Listen on All Interfaces
#

By default, Ollama only listens on 127.0.0.1, making it unreachable from WSL2 Docker containers. You must set a Windows system environment variable.

Open System Properties → Advanced → Environment Variables
Under System variables, click New
Variable name: OLLAMA_HOST
Variable value: 0.0.0.0
Click OK, then restart Ollama (kill the tray icon and relaunch)

Verify from WSL2:

1
curl http://$(ip route | awk '/default/ {print $3}'):11434/api/tags

From inside a Docker container, Ollama is reachable at host.docker.internal:11434.

Step 2 — Pull the Right Model
#

The model must support Ollama structured output (the format / JSON schema parameter). This uses constrained token-level decoding to enforce JSON output — not all models support it.

Critical: qwen3-vl:8b (the vision-language variant) does not support structured output. When you pass a format schema, Ollama silently returns an empty response string. This failure is silent and hard to diagnose.

Use qwen3:8b (the base model) instead:

1
2
# Run in PowerShell on Windows
ollama pull qwen3:8b

Test structured output works:

1
2
3
4
5
6
curl http://localhost:11434/api/generate -d '{
  "model": "qwen3:8b",
  "format": {"type": "object", "properties": {"title": {"type": "string"}}, "required": ["title"]},
  "prompt": "Return a JSON object with a title field set to hello world.",
  "stream": false
}'

The response field should be a non-empty JSON string. If it is "", the model does not support structured output.

Step 3 — Create Tags in paperless-ngx
#

Create two tags in paperless-ngx (Settings → Tags):

Tag	Purpose
`ai-pending`	Input filter — documents with this tag will be processed by paperless-ai
`ai-processed`	Output marker — paperless-ai adds this after successful processing

Set the matching algorithm for both tags to None (they are assigned by workflows and paperless-ai, not by auto-matching rules).

Note the tag IDs from the API (you will not need them explicitly, but useful for verification):

1
2
curl -s http://10.0.10.10:5656/api/tags/ \
  -H "Authorization: Token <YOUR_TOKEN>" | python3 -m json.tool | grep -A3 "ai-pending"

Step 4 — Create Workflows in paperless-ngx
#

paperless-ai never removes tags — it only adds them. The ai-pending tag must be removed after processing via a workflow. Set up two workflows in paperless-ngx (Settings → Workflows):

Workflow 1: “AI Processing Queue”
#

Trigger: Document Added
Action: Assign tag ai-pending

This ensures every newly added document enters the AI processing queue automatically.

Workflow 2: “Remove ai-pending after AI processed”
#

Trigger: Document Updated — has tag ai-processed
Action: Remove tag ai-pending

This cleans up the queue marker after paperless-ai finishes. Without this workflow, the tag ai-pending stays on every document and Ollama would reprocess them forever.

Step 5 — Create the paperless-ai Project Files
#

Create a directory for the project:

1
2
mkdir -p ~/repo/paperless-ai
cd ~/repo/paperless-ai

docker-compose.yml
#

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
services:
  paperless-ai:
    image: clusterzx/paperless-ai
    container_name: paperless-ai
    restart: unless-stopped
    user: "0:0"
    env_file:
      - .env
    ports:
      - "3000:3000"
    volumes:
      - paperless-ai_data:/app/data

volumes:
  paperless-ai_data:

The user: "0:0" directive is essential. paperless-ai writes config and a SQLite database inside /app/data. With Docker Desktop on WSL2, permission mapping issues cause the node user (default) to be unable to create files in the volume — running as root eliminates these problems entirely.

.env
#

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
PAPERLESS_API_URL=http://10.0.10.10:5656/api
PAPERLESS_API_TOKEN=<YOUR_TOKEN>
PAPERLESS_USERNAME=yang
AI_PROVIDER=ollama
OLLAMA_API_URL=http://host.docker.internal:11434
OLLAMA_MODEL=qwen3:8b
SCAN_INTERVAL=*/30 * * * *
PROCESS_PREDEFINED_DOCUMENTS=yes
TAGS=ai-pending
ADD_AI_PROCESSED_TAG=yes
AI_PROCESSED_TAG_NAME=ai-processed
USE_EXISTING_DATA=yes

Key settings explained:

TAGS=ai-pending — paperless-ai only processes documents that have this tag
SCAN_INTERVAL=*/30 * * * * — poll paperless-ngx every 30 minutes
PROCESS_PREDEFINED_DOCUMENTS=yes — process documents that already exist (not just new ones)
ADD_AI_PROCESSED_TAG=yes — add ai-processed tag after processing (required for the cleanup workflow)
USE_EXISTING_DATA=yes — do not overwrite AI results with original empty fields

Step 6 — Write the System Prompt
#

paperless-ai sends document text to Ollama with your custom system prompt. The prompt is read from /app/data/PROMPT.md inside the container (or set via the web UI at http://localhost:3000).

The prompt should define:

What document types exist (use consistent naming)
What topic tags are available
What custom fields to fill in
Explicit rules for edge cases

Key lessons from prompt engineering for this setup:

Specify all valid values explicitly — do not let the model invent document types or tags
Forbid reserved tags explicitly — if you have status tags managed by humans, list them as absolutely forbidden
Require string types for custom fields — paperless-ai expects all custom field values as strings; tell the model: “All custom field values must be strings (in quotes) or null. Write "2017.08" not 2017.08”
Give clear examples for ambiguous cases — e.g., “Medical bills → use type 发票收据 + tag #医疗, NOT type 医疗记录”

Example partial prompt structure:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
You are a document classification assistant for a personal family archive.

## Document Types (choose exactly one)
- Invoice/Receipt (发票收据): bills, invoices, receipts
- Tax Document (税务文件): W-2, 1099, tax returns
- Immigration Document (移民文件): visas, passports, I-94
...

## Tags (assign all that apply from this list only)
- #insurance (#保险)
- #medical (#医疗)
- #financial (#财务)
...

## Custom Fields
- Amount: numeric amount as string, e.g. "2017.08" or null
- Bill Period: statement period end date, YYYY-MM-DD format or null
- Expiry Date: expiration date, YYYY-MM-DD format or null
- Document Year: year as string, e.g. "2024" or null
- Account / Policy Number: account or policy number as string or null

## Rules
- All custom field values must be strings (in quotes) or null
- Medical bills → type: 发票收据, tag: #医疗 — do NOT use type 医疗记录
- NEVER assign under any circumstances: #待处理 #重要 #归档 — these are
  reserved human-only status tags and must NEVER appear in your output

Step 7 — Start the Container
#

1
2
cd ~/repo/paperless-ai
docker compose up -d

Open the web UI at http://localhost:3000 to verify the configuration. The UI allows reviewing and editing settings, and triggering a manual scan.

Important: After using the web UI to save settings, the authoritative configuration is stored in /app/data/.env inside the Docker volume. The docker-compose.env file sets initial environment variables; the UI writes its own config file which takes precedence for some settings. If you edit .env and need the container to pick up changes, use docker compose up -d (not docker compose restart — the restart command does not re-read env files).

Troubleshooting
#

Docker daemon not running
#

1
Error response from daemon: dial unix /var/run/docker.sock: no such file or directory

Start Docker Desktop on Windows. Consider enabling “Start on login” in Docker Desktop settings.

Ollama not reachable from WSL2
#

1
connect ETIMEDOUT 10.255.255.254:11434

This means OLLAMA_HOST=0.0.0.0 is not set or Ollama was not restarted after setting it. Verify Ollama is listening:

1
2
# In PowerShell
netstat -ano | findstr 11434

The local address should show 0.0.0.0:11434, not 127.0.0.1:11434.

.env changes not picked up
#

docker compose restart does not re-read the env_file. Always use:

1
docker compose up -d

This recreates the container with the new environment.

Structured output returns empty response
#

1
No response data from Ollama API

The model does not support Ollama’s format parameter. Check which model is running:

1
curl http://localhost:11434/api/tags

Switch from any *-vl variant to the base model. Replace qwen3-vl:8b with qwen3:8b.

Custom field value type error
#

1
TypeError: customField.value?.trim is not a function

The AI returned a numeric value (e.g., 2017.08) where paperless-ai expected a string ("2017.08"). Add this rule to your system prompt: “All custom field values must be strings (in quotes) or null.”

`#` in custom field name causes env parsing error
#

1
SyntaxError: Unterminated string in JSON at position 44

A custom field named something like Account / Policy # contains #, which is treated as a comment character in .env file parsing. Rename the field in paperless-ngx to avoid # — e.g., Account / Policy Number. Use the API to rename:

1
2
3
4
curl -X PATCH http://10.0.10.10:5656/api/custom_fields/5/ \
  -H "Authorization: Token <YOUR_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{"name": "Account / Policy Number"}'

ai-pending tag not removed after processing
#

The tag stays on documents after AI processing. This means the cleanup workflow is not set up or not triggering. Verify:

Workflow “Remove ai-pending after AI processed” exists in Settings → Workflows
The trigger is: Document Updated, condition: has tag ai-processed
The action is: Remove tag ai-pending

Remember: paperless-ai source code merges tags and never removes any. Removal requires the workflow.

sed corrupting unrelated env vars
#

If you use sed to edit /app/data/.env inside the container, be careful with substring matches. For example:

1
sed -i 's/CUSTOM_FIELDS=.*/NEW_VALUE/' /app/data/.env

This will also match ACTIVATE_CUSTOM_FIELDS= because CUSTOM_FIELDS is a substring. Use Python with an anchored pattern instead:

1
2
3
4
5
6
python3 -c "
import re, sys
content = open('/app/data/.env').read()
content = re.sub(r'^CUSTOM_FIELDS=.*', 'CUSTOM_FIELDS=NEW_VALUE', content, flags=re.MULTILINE)
open('/app/data/.env', 'w').write(content)
"

AI assigning forbidden status tags
#

The model occasionally assigns tags you have reserved for human use. Strengthen the prohibition in the prompt:

1
2
3
NEVER assign under any circumstances: #待处理 #重要 #归档 — these are
reserved human-only status tags and must NEVER appear in your output
under any circumstances, for any document type, regardless of content.

How paperless-ai Works Internally
#

Understanding the internals helps when debugging:

paperless-ai polls paperless-ngx API for documents with tag ai-pending
For each document, it fetches the full text content
It sends the text + system prompt to Ollama with format: jsonSchema parameter
Ollama uses constrained decoding (enforced at the token-sampling level) to produce valid JSON
paperless-ai parses the response: title, document_type, tags, correspondent, document_date, language, custom_fields
It calls paperlessService.updateDocument() which merges tags: [...new Set([...currentDoc.tags, ...updates.tags])] — it never removes tags
It adds the ai-processed tag to signal completion
The paperless-ngx workflow detects ai-processed and removes ai-pending

File Permissions Deep Dive
#

The user: "0:0" setting in docker-compose deserves explanation. paperless-ai’s base image runs as the node user. The named Docker volume’s root directory is owned by root:root with permissions 755. The node user can read and traverse the directory but cannot create new files in it (the application writes config atomically: create temp file, then rename — both require write permission to the directory). Running as root bypasses all of this.

An alternative approach — switching to a bind mount — fails on WSL2/Docker Desktop because the uid/gid mapping between WSL2 and Windows causes SQLite to be unable to create database files.

Daily Usage
#

paperless-ai processes documents on startup and then every 30 minutes per SCAN_INTERVAL
Monitor and trigger manual scans at http://localhost:3000
The web UI shows processing history and current queue status
To bulk-remove tags in paperless-ngx: list view → select one document → “Select all X documents” appears → Actions → Edit Tags

Summary
#

Component	Location	Notes
paperless-ngx	NAS `10.0.10.10:5656`	Document storage and API
paperless-ai	WSL2 Docker, port 3000	Orchestrates AI processing
Ollama	Windows native, port 11434	LLM inference with GPU
Model	`qwen3:8b`	Base model, not VL variant
Trigger tag	`ai-pending`	Added by paperless-ngx workflow
Completion tag	`ai-processed`	Added by paperless-ai

The entire pipeline is self-hosted, GPU-accelerated, and requires no cloud services. Documents are processed locally with full privacy.

Post-Setup Fixes
#

“Restrict to existing document types” setting does nothing (bug)
#

paperless-ai has a UI toggle to restrict document type assignment to existing types only. As of 2026-03, this is a confirmed bug (#834, #799): getOrCreateDocumentType() in services/paperlessService.js has no restriction guard, while getOrCreateCorrespondent() correctly implements it. A fix was submitted in PR #865 but closed as stale without merging.

Workaround: copy paperlessService.js out of the container, patch it, and bind-mount it back.

1
docker cp paperless-ai:/app/services/paperlessService.js ./paperlessService.js

In paperlessService.js, change the function signature and add the guard (mirror of how getOrCreateCorrespondent works):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
// Before:
async getOrCreateDocumentType(name) {

// After:
async getOrCreateDocumentType(name, options = {}) {
  const restrictToExistingDocumentTypes = options.restrictToExistingDocumentTypes === true ||
      (options.restrictToExistingDocumentTypes === undefined &&
       process.env.RESTRICT_TO_EXISTING_DOCUMENT_TYPES === 'yes');

  // ... after the existingDocType search, before the create call:
  if (restrictToExistingDocumentTypes) {
      console.log(`[DEBUG] Document type "${name}" does not exist and restrictions are enabled, returning null`);
      return null;
  }

Then bind-mount the patched file in docker-compose.yml:

1
2
3
volumes:
  - paperless-ai_data:/app/data
  - ./paperlessService.js:/app/services/paperlessService.js:ro

Recreate the container: docker compose up -d. The bind mount survives image updates. Check if the upstream bug gets fixed before removing it.

AI setting correspondent to the string “null”
#

The model outputs "null" as a string when no correspondent is known. Paperless-ai creates a correspondent literally named null. Fix: update the prompt to clarify the correspondent field accepts either a name string or JSON null (not the word “null”). Also make the JSON template show null unquoted:

1
2
3
"correspondent": "Name or null",   ← template hint
...
## Correspondent: JSON null if unclear — never the string "null"

Clean up any null correspondents via API:

1
2
3
4
5
# Find and delete
curl -s http://NAS:5656/api/correspondents/ -H "Authorization: Token TOKEN" | \
  python3 -c "import sys,json; [print(c['id'], c['name']) for c in json.load(sys.stdin)['results']]"

curl -X DELETE http://NAS:5656/api/correspondents/ID/ -H "Authorization: Token TOKEN"

Taxonomy design: document types
#

After running the system for a while, 发票收据 (Invoices & Receipts) became a catch-all for things that aren’t actual invoices — hotel booking confirmations, repair estimates, project quotes, and permit passes. The solution was to add two targeted types rather than keep forcing everything into a single bucket.

Added two new document types:

Type	Use case
`行程预订` (Travel & Booking)	Hotel/flight confirmations, event tickets, passes, permits (SNO-PARK, etc.)
`报价估价` (Quotes & Estimates)	Repair estimates, construction quotes, project proposals — anything not yet paid

Deleted all zero-doc AI-created English types (Estimate, Invoice, Quote, repair_estimate, Travel Itinerary, Technical Manual, Product Manual, manual — ids 20–27).

Reclassified ~9 documents: hotel confirmations → 行程预订, airline itineraries → 行程预订, vehicle/home repair estimates → 报价估价, work authorization → 合同协议.

The updated prompt rules that were added to reduce misclassification:

1
2
3
- 估价/报价（未付款）→ "报价估价"；已付发票 → "发票收据"
- 酒店/机票确认单 → "行程预订"（不是"发票收据"）
- 施工授权书、装修合同 → "合同协议"（不是"发票收据"）

Updating the system prompt without the web UI
#

The paperless-ai web UI is the normal way to update the system prompt, but it’s inconvenient for iterative editing. The prompt is stored as SYSTEM_PROMPT in the container’s /app/data/.env (inside the Docker volume) using \n-escaped single-line format.

Caveat: dotenv v16 treats # after a \n sequence as a comment in unquoted values, so a prompt containing ## Section headers gets silently truncated. The fix is to wrap the value in double quotes when writing.

Workflow: keep the prompt in PROMPT.md alongside docker-compose.yml, and run a helper script to push it into the container:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# update-prompt.sh — run after editing PROMPT.md
docker cp PROMPT.md paperless-ai:/tmp/PROMPT.md
docker exec paperless-ai node -e "
  const fs = require('fs'), dotenv = require('dotenv');
  const prompt = fs.readFileSync('/tmp/PROMPT.md', 'utf8').trimEnd();
  const escaped = prompt.replace(/\\\\/g,'\\\\\\\\').replace(/\"/g,'\\\\\"').replace(/\n/g,'\\\\n');
  let env = fs.readFileSync('/app/data/.env','utf8');
  fs.writeFileSync('/app/data/.env', env.replace(/^SYSTEM_PROMPT=.*$/m,'SYSTEM_PROMPT=\"'+escaped+'\"'));
  const val = dotenv.parse(fs.readFileSync('/app/data/.env')).SYSTEM_PROMPT;
  console.log('Updated:', val?.length, 'chars');
"
docker restart paperless-ai

The script lives at ~/repo/paperless-ai/update-prompt.sh and is executable.

Hardware and Architecture#

Prerequisites#

Step 1 — Set Up Ollama to Listen on All Interfaces#

Step 2 — Pull the Right Model#

Step 3 — Create Tags in paperless-ngx#

Step 4 — Create Workflows in paperless-ngx#

Workflow 1: “AI Processing Queue”#

Workflow 2: “Remove ai-pending after AI processed”#

Step 5 — Create the paperless-ai Project Files#

docker-compose.yml#

.env#

Step 6 — Write the System Prompt#

Step 7 — Start the Container#

Troubleshooting#

Docker daemon not running#

Ollama not reachable from WSL2#

.env changes not picked up#

Structured output returns empty response#

Custom field value type error#

# in custom field name causes env parsing error#

ai-pending tag not removed after processing#

sed corrupting unrelated env vars#

AI assigning forbidden status tags#

How paperless-ai Works Internally#

File Permissions Deep Dive#

Daily Usage#

Summary#

Post-Setup Fixes#

“Restrict to existing document types” setting does nothing (bug)#

AI setting correspondent to the string “null”#

Taxonomy design: document types#

Updating the system prompt without the web UI#