Skip to main content
  1. Posts/

AI-Powered Document Classification with paperless-ai and Ollama

·2743 words·13 mins
Author
Yang Hu
Table of Contents

This post is a complete runbook for integrating AI-powered auto-tagging and classification into paperless-ngx using paperless-ai and a locally-running Ollama instance. The setup uses a local LLM to read document text and automatically populate metadata fields — title, document type, tags, correspondent, date, and custom fields.

Hardware and Architecture
#

  • NAS (Synology DS1621+, 10.0.10.10): runs paperless-ngx on port 5656
  • Desktop PC: Windows with WSL2, Docker Desktop, RTX 4090
  • Goal: AI auto-tagging/classification using a local LLM, zero cloud dependency

The key architecture decision is a pull model: paperless-ai runs in WSL2 Docker, polls the paperless-ngx API for documents tagged ai-pending, processes them with Ollama, and writes metadata back. This is the correct approach for a desktop that is not on 24/7 — the NAS holds the queue and the desktop drains it when available.

1
2
3
4
5
6
7
paperless-ngx (NAS)
       ↑  ↓  (REST API)
 paperless-ai (WSL2 Docker)
       ↑  ↓  (HTTP)
    Ollama (Windows native)
    RTX 4090 (GPU)

Ollama runs natively on Windows (not in WSL) for best GPU access. From inside a Docker container in WSL2, it is reachable via the special hostname host.docker.internal.

Prerequisites
#

  • paperless-ngx running and accessible via API
  • Docker Desktop installed on Windows with WSL2 integration enabled
  • Ollama installed on Windows

Step 1 — Set Up Ollama to Listen on All Interfaces
#

By default, Ollama only listens on 127.0.0.1, making it unreachable from WSL2 Docker containers. You must set a Windows system environment variable.

  1. Open System PropertiesAdvancedEnvironment Variables
  2. Under System variables, click New
  3. Variable name: OLLAMA_HOST
  4. Variable value: 0.0.0.0
  5. Click OK, then restart Ollama (kill the tray icon and relaunch)

Verify from WSL2:

1
curl http://$(ip route | awk '/default/ {print $3}'):11434/api/tags

From inside a Docker container, Ollama is reachable at host.docker.internal:11434.

Step 2 — Pull the Right Model
#

The model must support Ollama structured output (the format / JSON schema parameter). This uses constrained token-level decoding to enforce JSON output — not all models support it.

Critical: qwen3-vl:8b (the vision-language variant) does not support structured output. When you pass a format schema, Ollama silently returns an empty response string. This failure is silent and hard to diagnose.

Use qwen3:8b (the base model) instead:

1
2
# Run in PowerShell on Windows
ollama pull qwen3:8b

Test structured output works:

1
2
3
4
5
6
curl http://localhost:11434/api/generate -d '{
  "model": "qwen3:8b",
  "format": {"type": "object", "properties": {"title": {"type": "string"}}, "required": ["title"]},
  "prompt": "Return a JSON object with a title field set to hello world.",
  "stream": false
}'

The response field should be a non-empty JSON string. If it is "", the model does not support structured output.

Step 3 — Create Tags in paperless-ngx
#

Create two tags in paperless-ngx (Settings → Tags):

TagPurpose
ai-pendingInput filter — documents with this tag will be processed by paperless-ai
ai-processedOutput marker — paperless-ai adds this after successful processing

Set the matching algorithm for both tags to None (they are assigned by workflows and paperless-ai, not by auto-matching rules).

Note the tag IDs from the API (you will not need them explicitly, but useful for verification):

1
2
curl -s http://10.0.10.10:5656/api/tags/ \
  -H "Authorization: Token <YOUR_TOKEN>" | python3 -m json.tool | grep -A3 "ai-pending"

Step 4 — Create Workflows in paperless-ngx
#

paperless-ai never removes tags — it only adds them. The ai-pending tag must be removed after processing via a workflow. Set up two workflows in paperless-ngx (Settings → Workflows):

Workflow 1: “AI Processing Queue”
#

  • Trigger: Document Added
  • Action: Assign tag ai-pending

This ensures every newly added document enters the AI processing queue automatically.

Workflow 2: “Remove ai-pending after AI processed”
#

  • Trigger: Document Updated — has tag ai-processed
  • Action: Remove tag ai-pending

This cleans up the queue marker after paperless-ai finishes. Without this workflow, the tag ai-pending stays on every document and Ollama would reprocess them forever.

Step 5 — Create the paperless-ai Project Files
#

Create a directory for the project:

1
2
mkdir -p ~/repo/paperless-ai
cd ~/repo/paperless-ai

docker-compose.yml
#

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
services:
  paperless-ai:
    image: clusterzx/paperless-ai
    container_name: paperless-ai
    restart: unless-stopped
    user: "0:0"
    env_file:
      - .env
    ports:
      - "3000:3000"
    volumes:
      - paperless-ai_data:/app/data

volumes:
  paperless-ai_data:

The user: "0:0" directive is essential. paperless-ai writes config and a SQLite database inside /app/data. With Docker Desktop on WSL2, permission mapping issues cause the node user (default) to be unable to create files in the volume — running as root eliminates these problems entirely.

.env
#

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
PAPERLESS_API_URL=http://10.0.10.10:5656/api
PAPERLESS_API_TOKEN=<YOUR_TOKEN>
PAPERLESS_USERNAME=yang
AI_PROVIDER=ollama
OLLAMA_API_URL=http://host.docker.internal:11434
OLLAMA_MODEL=qwen3:8b
SCAN_INTERVAL=*/30 * * * *
PROCESS_PREDEFINED_DOCUMENTS=yes
TAGS=ai-pending
ADD_AI_PROCESSED_TAG=yes
AI_PROCESSED_TAG_NAME=ai-processed
USE_EXISTING_DATA=yes

Key settings explained:

  • TAGS=ai-pending — paperless-ai only processes documents that have this tag
  • SCAN_INTERVAL=*/30 * * * * — poll paperless-ngx every 30 minutes
  • PROCESS_PREDEFINED_DOCUMENTS=yes — process documents that already exist (not just new ones)
  • ADD_AI_PROCESSED_TAG=yes — add ai-processed tag after processing (required for the cleanup workflow)
  • USE_EXISTING_DATA=yes — do not overwrite AI results with original empty fields

Step 6 — Write the System Prompt
#

paperless-ai sends document text to Ollama with your custom system prompt. The prompt is read from /app/data/PROMPT.md inside the container (or set via the web UI at http://localhost:3000).

The prompt should define:

  • What document types exist (use consistent naming)
  • What topic tags are available
  • What custom fields to fill in
  • Explicit rules for edge cases

Key lessons from prompt engineering for this setup:

  1. Specify all valid values explicitly — do not let the model invent document types or tags
  2. Forbid reserved tags explicitly — if you have status tags managed by humans, list them as absolutely forbidden
  3. Require string types for custom fields — paperless-ai expects all custom field values as strings; tell the model: “All custom field values must be strings (in quotes) or null. Write "2017.08" not 2017.08
  4. Give clear examples for ambiguous cases — e.g., “Medical bills → use type 发票收据 + tag #医疗, NOT type 医疗记录”

Example partial prompt structure:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
You are a document classification assistant for a personal family archive.

## Document Types (choose exactly one)
- Invoice/Receipt (发票收据): bills, invoices, receipts
- Tax Document (税务文件): W-2, 1099, tax returns
- Immigration Document (移民文件): visas, passports, I-94
...

## Tags (assign all that apply from this list only)
- #insurance (#保险)
- #medical (#医疗)
- #financial (#财务)
...

## Custom Fields
- Amount: numeric amount as string, e.g. "2017.08" or null
- Bill Period: statement period end date, YYYY-MM-DD format or null
- Expiry Date: expiration date, YYYY-MM-DD format or null
- Document Year: year as string, e.g. "2024" or null
- Account / Policy Number: account or policy number as string or null

## Rules
- All custom field values must be strings (in quotes) or null
- Medical bills → type: 发票收据, tag: #医疗 — do NOT use type 医疗记录
- NEVER assign under any circumstances: #待处理 #重要 #归档 — these are
  reserved human-only status tags and must NEVER appear in your output

Step 7 — Start the Container
#

1
2
cd ~/repo/paperless-ai
docker compose up -d

Open the web UI at http://localhost:3000 to verify the configuration. The UI allows reviewing and editing settings, and triggering a manual scan.

Important: After using the web UI to save settings, the authoritative configuration is stored in /app/data/.env inside the Docker volume. The docker-compose.env file sets initial environment variables; the UI writes its own config file which takes precedence for some settings. If you edit .env and need the container to pick up changes, use docker compose up -d (not docker compose restart — the restart command does not re-read env files).

Troubleshooting
#

Docker daemon not running
#

1
Error response from daemon: dial unix /var/run/docker.sock: no such file or directory

Start Docker Desktop on Windows. Consider enabling “Start on login” in Docker Desktop settings.

Ollama not reachable from WSL2
#

1
connect ETIMEDOUT 10.255.255.254:11434

This means OLLAMA_HOST=0.0.0.0 is not set or Ollama was not restarted after setting it. Verify Ollama is listening:

1
2
# In PowerShell
netstat -ano | findstr 11434

The local address should show 0.0.0.0:11434, not 127.0.0.1:11434.

.env changes not picked up
#

docker compose restart does not re-read the env_file. Always use:

1
docker compose up -d

This recreates the container with the new environment.

Structured output returns empty response
#

1
No response data from Ollama API

The model does not support Ollama’s format parameter. Check which model is running:

1
curl http://localhost:11434/api/tags

Switch from any *-vl variant to the base model. Replace qwen3-vl:8b with qwen3:8b.

Custom field value type error
#

1
TypeError: customField.value?.trim is not a function

The AI returned a numeric value (e.g., 2017.08) where paperless-ai expected a string ("2017.08"). Add this rule to your system prompt: “All custom field values must be strings (in quotes) or null.”

# in custom field name causes env parsing error
#

1
SyntaxError: Unterminated string in JSON at position 44

A custom field named something like Account / Policy # contains #, which is treated as a comment character in .env file parsing. Rename the field in paperless-ngx to avoid # — e.g., Account / Policy Number. Use the API to rename:

1
2
3
4
curl -X PATCH http://10.0.10.10:5656/api/custom_fields/5/ \
  -H "Authorization: Token <YOUR_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{"name": "Account / Policy Number"}'

ai-pending tag not removed after processing
#

The tag stays on documents after AI processing. This means the cleanup workflow is not set up or not triggering. Verify:

  1. Workflow “Remove ai-pending after AI processed” exists in Settings → Workflows
  2. The trigger is: Document Updated, condition: has tag ai-processed
  3. The action is: Remove tag ai-pending

Remember: paperless-ai source code merges tags and never removes any. Removal requires the workflow.

sed corrupting unrelated env vars
#

If you use sed to edit /app/data/.env inside the container, be careful with substring matches. For example:

1
sed -i 's/CUSTOM_FIELDS=.*/NEW_VALUE/' /app/data/.env

This will also match ACTIVATE_CUSTOM_FIELDS= because CUSTOM_FIELDS is a substring. Use Python with an anchored pattern instead:

1
2
3
4
5
6
python3 -c "
import re, sys
content = open('/app/data/.env').read()
content = re.sub(r'^CUSTOM_FIELDS=.*', 'CUSTOM_FIELDS=NEW_VALUE', content, flags=re.MULTILINE)
open('/app/data/.env', 'w').write(content)
"

AI assigning forbidden status tags
#

The model occasionally assigns tags you have reserved for human use. Strengthen the prohibition in the prompt:

1
2
3
NEVER assign under any circumstances: #待处理 #重要 #归档 — these are
reserved human-only status tags and must NEVER appear in your output
under any circumstances, for any document type, regardless of content.

How paperless-ai Works Internally
#

Understanding the internals helps when debugging:

  1. paperless-ai polls paperless-ngx API for documents with tag ai-pending
  2. For each document, it fetches the full text content
  3. It sends the text + system prompt to Ollama with format: jsonSchema parameter
  4. Ollama uses constrained decoding (enforced at the token-sampling level) to produce valid JSON
  5. paperless-ai parses the response: title, document_type, tags, correspondent, document_date, language, custom_fields
  6. It calls paperlessService.updateDocument() which merges tags: [...new Set([...currentDoc.tags, ...updates.tags])] — it never removes tags
  7. It adds the ai-processed tag to signal completion
  8. The paperless-ngx workflow detects ai-processed and removes ai-pending

File Permissions Deep Dive
#

The user: "0:0" setting in docker-compose deserves explanation. paperless-ai’s base image runs as the node user. The named Docker volume’s root directory is owned by root:root with permissions 755. The node user can read and traverse the directory but cannot create new files in it (the application writes config atomically: create temp file, then rename — both require write permission to the directory). Running as root bypasses all of this.

An alternative approach — switching to a bind mount — fails on WSL2/Docker Desktop because the uid/gid mapping between WSL2 and Windows causes SQLite to be unable to create database files.

Daily Usage
#

  • paperless-ai processes documents on startup and then every 30 minutes per SCAN_INTERVAL
  • Monitor and trigger manual scans at http://localhost:3000
  • The web UI shows processing history and current queue status
  • To bulk-remove tags in paperless-ngx: list view → select one document → “Select all X documents” appears → Actions → Edit Tags

Summary
#

ComponentLocationNotes
paperless-ngxNAS 10.0.10.10:5656Document storage and API
paperless-aiWSL2 Docker, port 3000Orchestrates AI processing
OllamaWindows native, port 11434LLM inference with GPU
Modelqwen3:8bBase model, not VL variant
Trigger tagai-pendingAdded by paperless-ngx workflow
Completion tagai-processedAdded by paperless-ai

The entire pipeline is self-hosted, GPU-accelerated, and requires no cloud services. Documents are processed locally with full privacy.

Post-Setup Fixes
#

“Restrict to existing document types” setting does nothing (bug)
#

paperless-ai has a UI toggle to restrict document type assignment to existing types only. As of 2026-03, this is a confirmed bug (#834, #799): getOrCreateDocumentType() in services/paperlessService.js has no restriction guard, while getOrCreateCorrespondent() correctly implements it. A fix was submitted in PR #865 but closed as stale without merging.

Workaround: copy paperlessService.js out of the container, patch it, and bind-mount it back.

1
docker cp paperless-ai:/app/services/paperlessService.js ./paperlessService.js

In paperlessService.js, change the function signature and add the guard (mirror of how getOrCreateCorrespondent works):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
// Before:
async getOrCreateDocumentType(name) {

// After:
async getOrCreateDocumentType(name, options = {}) {
  const restrictToExistingDocumentTypes = options.restrictToExistingDocumentTypes === true ||
      (options.restrictToExistingDocumentTypes === undefined &&
       process.env.RESTRICT_TO_EXISTING_DOCUMENT_TYPES === 'yes');

  // ... after the existingDocType search, before the create call:
  if (restrictToExistingDocumentTypes) {
      console.log(`[DEBUG] Document type "${name}" does not exist and restrictions are enabled, returning null`);
      return null;
  }

Then bind-mount the patched file in docker-compose.yml:

1
2
3
volumes:
  - paperless-ai_data:/app/data
  - ./paperlessService.js:/app/services/paperlessService.js:ro

Recreate the container: docker compose up -d. The bind mount survives image updates. Check if the upstream bug gets fixed before removing it.

AI setting correspondent to the string “null”
#

The model outputs "null" as a string when no correspondent is known. Paperless-ai creates a correspondent literally named null. Fix: update the prompt to clarify the correspondent field accepts either a name string or JSON null (not the word “null”). Also make the JSON template show null unquoted:

1
2
3
"correspondent": "Name or null",   ← template hint
...
## Correspondent: JSON null if unclear — never the string "null"

Clean up any null correspondents via API:

1
2
3
4
5
# Find and delete
curl -s http://NAS:5656/api/correspondents/ -H "Authorization: Token TOKEN" | \
  python3 -c "import sys,json; [print(c['id'], c['name']) for c in json.load(sys.stdin)['results']]"

curl -X DELETE http://NAS:5656/api/correspondents/ID/ -H "Authorization: Token TOKEN"

Taxonomy design: document types
#

After running the system for a while, 发票收据 (Invoices & Receipts) became a catch-all for things that aren’t actual invoices — hotel booking confirmations, repair estimates, project quotes, and permit passes. The solution was to add two targeted types rather than keep forcing everything into a single bucket.

Added two new document types:

TypeUse case
行程预订 (Travel & Booking)Hotel/flight confirmations, event tickets, passes, permits (SNO-PARK, etc.)
报价估价 (Quotes & Estimates)Repair estimates, construction quotes, project proposals — anything not yet paid

Deleted all zero-doc AI-created English types (Estimate, Invoice, Quote, repair_estimate, Travel Itinerary, Technical Manual, Product Manual, manual — ids 20–27).

Reclassified ~9 documents: hotel confirmations → 行程预订, airline itineraries → 行程预订, vehicle/home repair estimates → 报价估价, work authorization → 合同协议.

The updated prompt rules that were added to reduce misclassification:

1
2
3
- 估价/报价(未付款)→ "报价估价";已付发票 → "发票收据"
- 酒店/机票确认单 → "行程预订"(不是"发票收据")
- 施工授权书、装修合同 → "合同协议"(不是"发票收据")

Updating the system prompt without the web UI
#

The paperless-ai web UI is the normal way to update the system prompt, but it’s inconvenient for iterative editing. The prompt is stored as SYSTEM_PROMPT in the container’s /app/data/.env (inside the Docker volume) using \n-escaped single-line format.

Caveat: dotenv v16 treats # after a \n sequence as a comment in unquoted values, so a prompt containing ## Section headers gets silently truncated. The fix is to wrap the value in double quotes when writing.

Workflow: keep the prompt in PROMPT.md alongside docker-compose.yml, and run a helper script to push it into the container:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# update-prompt.sh — run after editing PROMPT.md
docker cp PROMPT.md paperless-ai:/tmp/PROMPT.md
docker exec paperless-ai node -e "
  const fs = require('fs'), dotenv = require('dotenv');
  const prompt = fs.readFileSync('/tmp/PROMPT.md', 'utf8').trimEnd();
  const escaped = prompt.replace(/\\\\/g,'\\\\\\\\').replace(/\"/g,'\\\\\"').replace(/\n/g,'\\\\n');
  let env = fs.readFileSync('/app/data/.env','utf8');
  fs.writeFileSync('/app/data/.env', env.replace(/^SYSTEM_PROMPT=.*$/m,'SYSTEM_PROMPT=\"'+escaped+'\"'));
  const val = dotenv.parse(fs.readFileSync('/app/data/.env')).SYSTEM_PROMPT;
  console.log('Updated:', val?.length, 'chars');
"
docker restart paperless-ai

The script lives at ~/repo/paperless-ai/update-prompt.sh and is executable.