Runbook and design journal for migrating ~400 personal documents from a folder-based Google Drive system into Paperless-ngx on a Synology NAS. Covers taxonomy design, bulk import from Google Takeout, ML classifier setup, and ongoing intake workflow.
Problem Statement#
For years my “document management” was a manually maintained folder tree on Google Drive:
| |
This worked well enough for filing but poorly for retrieval. Finding “what insurance forms did I have in 2022?” meant navigating six folders and guessing what I named things. Paperless-ngx offers full-text search, OCR, and an ML classifier that learns from your own labeling — a meaningfully better system for a document archive that spans immigration paperwork, tax filings, mortgage docs, and medical records across 10+ years.
Architecture#
| |
The NAS runs Paperless-ngx in Docker via Container Manager. All storage
(documents, database, Redis) is mounted to /volume1/docker/paperless/.
Taxonomy Design#
Getting the taxonomy right before bulk import is important — it’s the training signal for the ML classifier. Wrong labels in 400 documents teach the wrong thing.
Document Types#
The goal was types that are mutually exclusive and exhaustive for the documents I actually have. Chinese names throughout — the ML classifier learns from your labels regardless of language.
| ID | Name | What goes here |
|---|---|---|
| 1 | 发票收据 | Invoices, receipts, paid orders (completed transactions only) |
| 3 | 操作手册 | Product manuals, user guides, assembly instructions |
| 4 | 活动通告 | Event invitations, announcements (non-booking) |
| 5 | 参考资料 | Reference material, price lists, brochures |
| 6 | 设备信息 | Equipment specs, device records, warranties |
| 7 | 日程课表 | Recurring schedules, class calendars, timetables |
| 8 | 金融账单 | Bank/investment/brokerage statements |
| 9 | 税务文件 | Tax returns, W-2, 1099, 1098, HSA |
| 10 | 身份证件 | Passports, driver’s license, ID cards |
| 11 | 合同协议 | Contracts, agreements, leases, work authorizations |
| 12 | 医疗记录 | Medical records, prescriptions, lab reports |
| 13 | 证明证书 | Certificates, proof letters, notarizations |
| 14 | 移民文件 | I-797, I-94, I-20, EAD, green card |
| 15 | 签证申请 | Visa applications (US, China, Canada, etc.) |
| 16 | 工资单 | Pay stubs |
| 17 | 房产文件 | Mortgage, permits, property tax, HOA |
| 18 | 车辆文件 | Car leases, DMV, registration |
| 28 | 行程预订 | Hotel/flight booking confirmations, event tickets, passes, permits (SNO-PARK, etc.) |
| 29 | 报价估价 | Repair estimates, construction quotes, project proposals (pre-payment) |
Key design decisions:
- No English names needed — Paperless ML learns from whatever you use
- 发票收据 = paid transactions only — Hotel confirmations go to 行程预订; repair estimates go to 报价估价; only completed-payment docs go here
- 行程预订 vs 发票收据 — A hotel confirmation is a booking, not a receipt; the receipt comes when you check out. Tickets and passes live here too
- 报价估价 vs 发票收据 — Unpaid estimates and quotes are pre-purchase; once paid and invoiced they become 发票收据
- 设备信息 ≠ 参考资料 — Equipment records (serial numbers, warranties) are structurally different from reference material (price lists, brochures)
- 活动通告 ≠ 日程课表 — Event notices are one-time announcements; schedules are recurring reference docs
- 金融账单 merges bank + investment — Both are periodic statements; the correspondent (HSBC vs Vanguard) distinguishes them if needed
- 税务文件 covers all tax docs — No need to split W-2 vs 1099 vs return at the type level; tags and correspondents carry that nuance
- 移民文件 vs 签证申请 — Maintained status documents (I-797, EAD) vs active application packages are genuinely different workflows
- 医疗账单 → 发票收据 + #医疗 tag — Medical bills are receipts; the tag provides the “medical” dimension without a redundant type
Tags#
Tags handle cross-cutting dimensions that don’t belong in doc types:
| Tag | Purpose |
|---|---|
| #保险 | Insurance-related |
| #医疗 | Medical topic |
| #教育 | Education |
| #财务 | Finance topic |
| #房产 | Real estate |
| #旅行 | Travel |
| #车辆 | Vehicle |
| #移民 | Immigration topic |
| #签证 | Visa topic |
| #税务 | Tax topic |
| #待处理 | Inbox / needs review |
| #重要 | Important, time-sensitive |
| #归档 | Archived, no action needed |
Year tags were considered and rejected. Initial plan included #2016 through #2026 on every document. After reflection: year tags add noise to the ML training signal, and a “Document Year” custom field covers the use case more cleanly (filterable, sortable, not cluttering the tag cloud).
Custom Fields#
| ID | Name | Type | Usage |
|---|---|---|---|
| 1 | Amount | float | Invoice/bill amounts |
| 2 | Bill Period | date | Statement period end date |
| 3 | 到期日期 | date | Expiry date (documents, visas) |
| 4 | Document Year | integer | Tax year, document year for historical docs |
| 5 | 保单/账号 | string | Policy number, account number |
Correspondents#
Set up for entities that appear frequently enough to be useful as filters:
| ID | Name |
|---|---|
| 4 | Total Vision Campbell |
| 5 | IRS |
| 6 | California FTB |
| 7 | Google LLC |
| 8 | USCIS |
| 9 | US Dept of State |
| 10 | Vanguard |
| 11 | HSBC |
| 12 | County of Santa Clara |
Pre-Import: Disable ML Matching#
Before bulk importing 400 documents, disable Auto/ML matching on all document types, tags, and correspondents. If auto-matching is active during import, Paperless may attempt to re-classify documents using a half-trained model and overwrite your carefully assigned metadata.
| |
Re-enable after import with "matching_algorithm": 6 for types, tags, and
correspondents where you want ML suggestions.
Exception: #待处理 should stay at 0 (None) permanently. It’s a status tag applied by workflow, not a content category — the ML has no business guessing it.
Migration Script#
migrate.py handles classification and upload in one pass. It reads
directly from the Google Takeout .zip without extracting everything first.
Classification Logic#
Each file is classified by its folder path using a priority-ordered chain
of if checks:
| |
Year is extracted from any path component matching \b(20[12]\d)\b —
so 30 - Tax Filing/2022/W2.pdf gets year=2022 automatically.
Correspondent is inferred from filename keywords first (google, hsbc,
vanguard, irs, ftb), then overridden by folder-specific rules.
Upload#
| |
Tags must be sent as repeated form fields (not a JSON array):
| |
Custom fields use JSON-encoded dict with string keys:
| |
File Type Filtering#
Skip non-document files that made it into the Takeout:
| |
.doc/.docx are technically supported by Paperless but unreliable for
OCR; export to PDF first if you care about full-text search on those.
Dry Run#
| |
Output:
| |
Execute#
| |
The --yes flag skips the confirmation prompt (necessary when running over
SSH where input() hangs). Paperless deduplicates by content hash, so
re-running is safe.
Results#
358 documents imported across 17 document types (19 after taxonomy expansion; see below). Upload took ~25 minutes
including OCR processing time. Zero errors after fixing the .doc/.docx
exclusions.
ML Classifier Training#
After all documents were OCR-processed and the #待处理 tag bulk-removed:
| |
With ~400 labeled documents, the classifier has a solid training set for the most common document types (税务文件, 移民文件, 金融账单 each have 50-90 examples). Rarer types like 操作手册 or 活动通告 will improve as more documents are added.
Wait until OCR is complete before training. The classifier trains on the OCR’d content, not the raw file. Running too early means training on empty or partial text.
Inbox Workflow#
New documents (uploaded manually, scanned via mobile app, or imported via email) automatically get tagged #待处理:
Settings → Workflows → Add Workflow:
- Name:
New Document Inbox - Trigger: Document Added (type 2)
- Action: Assign Tag #待处理 (action type 1)
This gives you a reliable inbox view without relying on ML guessing. The workflow fires on every new document regardless of source.
#待处理 has matching_algorithm: 0 (None) — it is never assigned by
ML, only by workflow. This keeps it clean as a status signal.
Clear the inbox by removing #待处理 after reviewing a document.
Ongoing Usage Patterns#
Intake sources#
| Source | Method |
|---|---|
| Paper documents | Scan with mobile app (e.g. Scanner Pro), upload to Paperless |
| Email attachments | Paperless email inbox (IMAP polling configured separately) |
| Downloaded PDFs | Drag-drop to Paperless UI or consume folder |
| Consume folder | SMB share from NAS, accessible from Windows/Mac |
Review workflow#
- Open saved search: tag:#待处理
- For each document: verify type, add any missing tags, fix title if needed
- Remove #待处理 — document moves out of inbox
- Add #重要 for anything time-sensitive or expiring soon
Finding documents#
- Full-text search handles most cases (e.g. “EAD renewal 2023”)
- Filter by document type + correspondent for statements (金融账单 + Vanguard)
- Filter by correspondent + Document Year custom field for tax docs
- Tag #归档 on anything fully processed and unlikely to need action
Credential backup#
Store Paperless admin password and API token in your password manager (Bitwarden). The API token lives in Settings → Tokens; regenerate and update any scripts if rotated.
Debugging Notes#
待处理 appearing on all imported docs#
If you ran the ML classifier before disabling Auto/ML on #待处理, it may have learned to tag everything. Bulk-remove via API:
| |
Document type auto-assigned incorrectly after import#
ML may assign wrong types if it trained on an imbalanced or noisy corpus. Check the document, correct manually, then retrain:
| |
Each manual correction feeds back into the training set.
custom_fields format error#
Paperless expects custom fields as a JSON-encoded dict with string keys:
| |
Port conflicts#
If Paperless is unreachable, check that no other service is using port 5656 on the NAS. DSM firewall may also block it from certain subnets — check Control Panel → Security → Firewall.
My Setup#
| Item | Value |
|---|---|
| NAS | Synology DS1621+ |
| NAS IP | 10.0.10.10 |
| Paperless port | 5656 |
| Docker image | ghcr.io/paperless-ngx/paperless-ngx:latest |
| Documents imported | 358 |
| Import source | Google Takeout (single .zip, ~2 GB) |
| Languages | Chinese + English (OCR configured for both) |
| Training corpus | ~400 labeled docs across 19 types |
Notes#
- The
post_documentendpoint is async — Paperless queues the document for OCR and returns immediately with{"result": "OK"}. The document won’t be searchable until OCR completes (usually seconds to a minute per doc, depending on NAS load) - Paperless deduplicates by SHA256 of the file content — safe to re-run the import script; duplicates are silently skipped
- Google Takeout exports Google Docs/Sheets as their native format by default; to get PDFs, use the “Export format: PDF” option when creating the Takeout
- The
document_create_classifiercommand printsNo updates since last trainingif the training set hasn’t changed since the last run — this is normal, not an error