Paperless-ngx: Migrating a Decade of Documents from Google Drive

Table of Contents

Runbook and design journal for migrating ~400 personal documents from a folder-based Google Drive system into Paperless-ngx on a Synology NAS. Covers taxonomy design, bulk import from Google Takeout, ML classifier setup, and ongoing intake workflow.

Problem Statement
#

For years my “document management” was a manually maintained folder tree on Google Drive:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
10 - 文书材料/
  10 - 证件材料/身份证件/
  30 - 移民文档/
  30 - Tax Filing/
  40 - Finance/
  50 - 车辆注册/
  60 - 住房买房/
  80 - Medical/
20 - 家装住房信息/
80 - 旅行计划/

This worked well enough for filing but poorly for retrieval. Finding “what insurance forms did I have in 2022?” meant navigating six folders and guessing what I named things. Paperless-ngx offers full-text search, OCR, and an ML classifier that learns from your own labeling — a meaningfully better system for a document archive that spans immigration paperwork, tax filings, mortgage docs, and medical records across 10+ years.

Architecture
#

1
2
3
4
5
6
7
8
9
Google Takeout .zip
  │  migrate.py (classify + upload)
  ▼
Paperless-ngx (Docker, 10.0.10.10:5656)
  │  REST API  POST /api/documents/post_document/
  │  OCR + full-text index
  │  ML classifier (re-trains on labeled corpus)
  ▼
Synology NAS storage (/volume1/docker/paperless/)

The NAS runs Paperless-ngx in Docker via Container Manager. All storage (documents, database, Redis) is mounted to /volume1/docker/paperless/.

Taxonomy Design
#

Getting the taxonomy right before bulk import is important — it’s the training signal for the ML classifier. Wrong labels in 400 documents teach the wrong thing.

Document Types
#

The goal was types that are mutually exclusive and exhaustive for the documents I actually have. Chinese names throughout — the ML classifier learns from your labels regardless of language.

ID	Name	What goes here
1	发票收据	Invoices, receipts, paid orders (completed transactions only)
3	操作手册	Product manuals, user guides, assembly instructions
4	活动通告	Event invitations, announcements (non-booking)
5	参考资料	Reference material, price lists, brochures
6	设备信息	Equipment specs, device records, warranties
7	日程课表	Recurring schedules, class calendars, timetables
8	金融账单	Bank/investment/brokerage statements
9	税务文件	Tax returns, W-2, 1099, 1098, HSA
10	身份证件	Passports, driver’s license, ID cards
11	合同协议	Contracts, agreements, leases, work authorizations
12	医疗记录	Medical records, prescriptions, lab reports
13	证明证书	Certificates, proof letters, notarizations
14	移民文件	I-797, I-94, I-20, EAD, green card
15	签证申请	Visa applications (US, China, Canada, etc.)
16	工资单	Pay stubs
17	房产文件	Mortgage, permits, property tax, HOA
18	车辆文件	Car leases, DMV, registration
28	行程预订	Hotel/flight booking confirmations, event tickets, passes, permits (SNO-PARK, etc.)
29	报价估价	Repair estimates, construction quotes, project proposals (pre-payment)

Key design decisions:

No English names needed — Paperless ML learns from whatever you use
发票收据 = paid transactions only — Hotel confirmations go to 行程预订; repair estimates go to 报价估价; only completed-payment docs go here
行程预订 vs 发票收据 — A hotel confirmation is a booking, not a receipt; the receipt comes when you check out. Tickets and passes live here too
报价估价 vs 发票收据 — Unpaid estimates and quotes are pre-purchase; once paid and invoiced they become 发票收据
设备信息 ≠ 参考资料 — Equipment records (serial numbers, warranties) are structurally different from reference material (price lists, brochures)
活动通告 ≠ 日程课表 — Event notices are one-time announcements; schedules are recurring reference docs
金融账单 merges bank + investment — Both are periodic statements; the correspondent (HSBC vs Vanguard) distinguishes them if needed
税务文件 covers all tax docs — No need to split W-2 vs 1099 vs return at the type level; tags and correspondents carry that nuance
移民文件 vs 签证申请 — Maintained status documents (I-797, EAD) vs active application packages are genuinely different workflows
医疗账单 → 发票收据 + #医疗 tag — Medical bills are receipts; the tag provides the “medical” dimension without a redundant type

Tags
#

Tags handle cross-cutting dimensions that don’t belong in doc types:

Tag	Purpose
#保险	Insurance-related
#医疗	Medical topic
#教育	Education
#财务	Finance topic
#房产	Real estate
#旅行	Travel
#车辆	Vehicle
#移民	Immigration topic
#签证	Visa topic
#税务	Tax topic
#待处理	Inbox / needs review
#重要	Important, time-sensitive
#归档	Archived, no action needed

Year tags were considered and rejected. Initial plan included #2016 through #2026 on every document. After reflection: year tags add noise to the ML training signal, and a “Document Year” custom field covers the use case more cleanly (filterable, sortable, not cluttering the tag cloud).

Custom Fields
#

ID	Name	Type	Usage
1	Amount	float	Invoice/bill amounts
2	Bill Period	date	Statement period end date
3	到期日期	date	Expiry date (documents, visas)
4	Document Year	integer	Tax year, document year for historical docs
5	保单/账号	string	Policy number, account number

Correspondents
#

Set up for entities that appear frequently enough to be useful as filters:

ID	Name
4	Total Vision Campbell
5	IRS
6	California FTB
7	Google LLC
8	USCIS
9	US Dept of State
10	Vanguard
11	HSBC
12	County of Santa Clara

Pre-Import: Disable ML Matching
#

Before bulk importing 400 documents, disable Auto/ML matching on all document types, tags, and correspondents. If auto-matching is active during import, Paperless may attempt to re-classify documents using a half-trained model and overwrite your carefully assigned metadata.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Disable matching on all document types
curl -s http://10.0.10.10:5656/api/document_types/?page_size=50 \
  -H "Authorization: Token YOUR_TOKEN" | jq '.results[] | .id' | \
while read id; do
  curl -s -X PATCH http://10.0.10.10:5656/api/document_types/$id/ \
    -H "Authorization: Token YOUR_TOKEN" \
    -H "Content-Type: application/json" \
    -d '{"matching_algorithm": 0}'
done

# Same for tags and correspondents
# matching_algorithm: 0=None, 1=Any, 2=All, 3=Literal, 4=Regex, 6=Auto/ML

Re-enable after import with "matching_algorithm": 6 for types, tags, and correspondents where you want ML suggestions.

Exception: #待处理 should stay at 0 (None) permanently. It’s a status tag applied by workflow, not a content category — the ML has no business guessing it.

Migration Script
#

migrate.py handles classification and upload in one pass. It reads directly from the Google Takeout .zip without extracting everything first.

Classification Logic
#

Each file is classified by its folder path using a priority-ordered chain of if checks:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def classify(rel_path: str) -> dict:
    p = rel_path
    filename = Path(p).name
    year = extract_year(Path(p).parts)  # from any path component
    corr = corr_from_filename(filename) # keyword match on filename

    if "10 - 身份证件" in p:
        return result(DOCTYPE["身份证件"], [TAG["移民"]])

    if "30 - 移民文档" in p:
        return result(DOCTYPE["移民文件"], [TAG["移民"]])

    if "30 - Tax Filing" in p:
        tax_tags = [TAG["税务"]]
        if re.search(r'w[\-\s]?2\b', fl):
            return result(DOCTYPE["税务文件"], tax_tags, corr or CORR["Google LLC"])
        if re.search(r'(f?1099|f?1098|f?5498|1095)', fl):
            return result(DOCTYPE["税务文件"], tax_tags, corr)
        return result(DOCTYPE["税务文件"], tax_tags)

    # ... more rules ...

    return result(None, [TAG["待处理"]])  # fallback: inbox

Year is extracted from any path component matching \b(20[12]\d)\b — so 30 - Tax Filing/2022/W2.pdf gets year=2022 automatically.

Correspondent is inferred from filename keywords first (google, hsbc, vanguard, irs, ftb), then overridden by folder-specific rules.

Upload
#

1
2
3
4
5
6
7
resp = requests.post(
    f"{API_BASE}/documents/post_document/",
    headers={"Authorization": f"Token {API_TOKEN}"},
    files={"document": (filename, data)},
    data=form_data,   # list of (key, value) tuples for repeated fields
    timeout=120,
)

Tags must be sent as repeated form fields (not a JSON array):

1
2
for tag_id in meta["tags"]:
    form_data.append(("tags", tag_id))

Custom fields use JSON-encoded dict with string keys:

1
form_data.append(("custom_fields", json.dumps({str(YEAR_FIELD_ID): year})))

File Type Filtering
#

Skip non-document files that made it into the Takeout:

1
2
3
4
5
6
SKIP_EXTENSIONS = {
    ".html", ".csv", ".qfx", ".gsheet", ".gdoc", ".gslides",
    ".gdraw", ".gmap", ".java", ".bin", ".db", ".exe", ".7z",
    ".rar", ".gshortcut", ".pptx", ".xls", ".xlsx", ".tar",
    ".doc", ".docx",  # Paperless can't OCR these reliably
}

.doc/.docx are technically supported by Paperless but unreliable for OCR; export to PDF first if you care about full-text search on those.

Dry Run
#

1
python3 migrate.py --dry-run

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
=================================================================
  IMPORT PREVIEW — 358 files to import
=================================================================

By document type:
  税务文件        89 files
  移民文件        72 files
  金融账单        61 files
  ...

Skipped (non-document files): 47
  ext:.html                       18
  ext:.csv                        12
  ...

Sample mappings (first 30):
  10 - 文书材料/30 - Tax Filing/2022/W2-Google.pdf
    → 税务文件 | tags:['税务'] | corr:Google LLC | year:2022
    title: W2-Google

Execute
#

1
python3 migrate.py --execute --yes

The --yes flag skips the confirmation prompt (necessary when running over SSH where input() hangs). Paperless deduplicates by content hash, so re-running is safe.

Results
#

358 documents imported across 17 document types (19 after taxonomy expansion; see below). Upload took ~25 minutes including OCR processing time. Zero errors after fixing the .doc/.docx exclusions.

ML Classifier Training
#

After all documents were OCR-processed and the #待处理 tag bulk-removed:

1
2
3
# Run from NAS with docker permissions
/usr/local/bin/docker exec paperless-webserver-1 \
  python3 manage.py document_create_classifier

With ~400 labeled documents, the classifier has a solid training set for the most common document types (税务文件, 移民文件, 金融账单 each have 50-90 examples). Rarer types like 操作手册 or 活动通告 will improve as more documents are added.

Wait until OCR is complete before training. The classifier trains on the OCR’d content, not the raw file. Running too early means training on empty or partial text.

Inbox Workflow
#

New documents (uploaded manually, scanned via mobile app, or imported via email) automatically get tagged #待处理:

Settings → Workflows → Add Workflow:

Name: New Document Inbox
Trigger: Document Added (type 2)
Action: Assign Tag #待处理 (action type 1)

This gives you a reliable inbox view without relying on ML guessing. The workflow fires on every new document regardless of source.

#待处理 has matching_algorithm: 0 (None) — it is never assigned by ML, only by workflow. This keeps it clean as a status signal.

Clear the inbox by removing #待处理 after reviewing a document.

Ongoing Usage Patterns
#

Intake sources
#

Source	Method
Paper documents	Scan with mobile app (e.g. Scanner Pro), upload to Paperless
Email attachments	Paperless email inbox (IMAP polling configured separately)
Downloaded PDFs	Drag-drop to Paperless UI or consume folder
Consume folder	SMB share from NAS, accessible from Windows/Mac

Review workflow
#

Open saved search: tag:#待处理
For each document: verify type, add any missing tags, fix title if needed
Remove #待处理 — document moves out of inbox
Add #重要 for anything time-sensitive or expiring soon

Finding documents
#

Full-text search handles most cases (e.g. “EAD renewal 2023”)
Filter by document type + correspondent for statements (金融账单 + Vanguard)
Filter by correspondent + Document Year custom field for tax docs
Tag #归档 on anything fully processed and unlikely to need action

Credential backup
#

Store Paperless admin password and API token in your password manager (Bitwarden). The API token lives in Settings → Tokens; regenerate and update any scripts if rotated.

Debugging Notes
#

待处理 appearing on all imported docs
#

If you ran the ML classifier before disabling Auto/ML on #待处理, it may have learned to tag everything. Bulk-remove via API:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import requests

API_BASE  = "http://10.0.10.10:5656/api"
API_TOKEN = "YOUR_TOKEN"
TAG_ID    = 7  # #待处理

headers = {"Authorization": f"Token {API_TOKEN}",
           "Content-Type": "application/json"}

# Get all docs with this tag
r = requests.get(f"{API_BASE}/documents/?tags__id__all={TAG_ID}&page_size=500",
                 headers=headers)
doc_ids = [d["id"] for d in r.json()["results"]]
print(f"Total docs to clean: {len(doc_ids)}")

# Bulk remove tag
r = requests.post(f"{API_BASE}/documents/bulk_edit/",
    headers=headers,
    json={"documents": doc_ids,
          "method": "remove_tag",
          "parameters": {"tag": TAG_ID}})
print(r.status_code, r.text)

Document type auto-assigned incorrectly after import
#

ML may assign wrong types if it trained on an imbalanced or noisy corpus. Check the document, correct manually, then retrain:

1
2
/usr/local/bin/docker exec paperless-webserver-1 \
  python3 manage.py document_create_classifier

Each manual correction feeds back into the training set.

custom_fields format error
#

Paperless expects custom fields as a JSON-encoded dict with string keys:

1
2
3
4
5
# ✓ Correct
json.dumps({"4": 2024})

# ✗ Wrong — causes 400 error
json.dumps([{"field": 4, "value": 2024}])

Port conflicts
#

If Paperless is unreachable, check that no other service is using port 5656 on the NAS. DSM firewall may also block it from certain subnets — check Control Panel → Security → Firewall.

My Setup
#

Item	Value
NAS	Synology DS1621+
NAS IP	10.0.10.10
Paperless port	5656
Docker image	`ghcr.io/paperless-ngx/paperless-ngx:latest`
Documents imported	358
Import source	Google Takeout (single .zip, ~2 GB)
Languages	Chinese + English (OCR configured for both)
Training corpus	~400 labeled docs across 19 types

Notes
#

The post_document endpoint is async — Paperless queues the document for OCR and returns immediately with {"result": "OK"}. The document won’t be searchable until OCR completes (usually seconds to a minute per doc, depending on NAS load)
Paperless deduplicates by SHA256 of the file content — safe to re-run the import script; duplicates are silently skipped
Google Takeout exports Google Docs/Sheets as their native format by default; to get PDFs, use the “Export format: PDF” option when creating the Takeout
The document_create_classifier command prints No updates since last training if the training set hasn’t changed since the last run — this is normal, not an error

Problem Statement#

Architecture#

Taxonomy Design#

Document Types#

Tags#

Custom Fields#

Correspondents#

Pre-Import: Disable ML Matching#

Migration Script#

Classification Logic#

Upload#

File Type Filtering#

Dry Run#

Execute#

Results#

ML Classifier Training#

Inbox Workflow#

Ongoing Usage Patterns#

Intake sources#

Review workflow#

Finding documents#

Credential backup#

Debugging Notes#

待处理 appearing on all imported docs#

Document type auto-assigned incorrectly after import#

custom_fields format error#

Port conflicts#

My Setup#

Notes#