Skip to main content
  1. Posts/

Paperless-ngx: Migrating a Decade of Documents from Google Drive

Author
Yang Hu

Runbook and design journal for migrating ~400 personal documents from a folder-based Google Drive system into Paperless-ngx on a Synology NAS. Covers taxonomy design, bulk import from Google Takeout, ML classifier setup, and ongoing intake workflow.

Problem Statement
#

For years my “document management” was a manually maintained folder tree on Google Drive:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
10 - 文书材料/
  10 - 证件材料/身份证件/
  30 - 移民文档/
  30 - Tax Filing/
  40 - Finance/
  50 - 车辆注册/
  60 - 住房买房/
  80 - Medical/
20 - 家装住房信息/
80 - 旅行计划/

This worked well enough for filing but poorly for retrieval. Finding “what insurance forms did I have in 2022?” meant navigating six folders and guessing what I named things. Paperless-ngx offers full-text search, OCR, and an ML classifier that learns from your own labeling — a meaningfully better system for a document archive that spans immigration paperwork, tax filings, mortgage docs, and medical records across 10+ years.

Architecture
#

1
2
3
4
5
6
7
8
9
Google Takeout .zip
  │  migrate.py (classify + upload)
Paperless-ngx (Docker, 10.0.10.10:5656)
  │  REST API  POST /api/documents/post_document/
  │  OCR + full-text index
  │  ML classifier (re-trains on labeled corpus)
Synology NAS storage (/volume1/docker/paperless/)

The NAS runs Paperless-ngx in Docker via Container Manager. All storage (documents, database, Redis) is mounted to /volume1/docker/paperless/.

Taxonomy Design
#

Getting the taxonomy right before bulk import is important — it’s the training signal for the ML classifier. Wrong labels in 400 documents teach the wrong thing.

Document Types
#

The goal was types that are mutually exclusive and exhaustive for the documents I actually have. Chinese names throughout — the ML classifier learns from your labels regardless of language.

IDNameWhat goes here
1发票收据Invoices, receipts, paid orders (completed transactions only)
3操作手册Product manuals, user guides, assembly instructions
4活动通告Event invitations, announcements (non-booking)
5参考资料Reference material, price lists, brochures
6设备信息Equipment specs, device records, warranties
7日程课表Recurring schedules, class calendars, timetables
8金融账单Bank/investment/brokerage statements
9税务文件Tax returns, W-2, 1099, 1098, HSA
10身份证件Passports, driver’s license, ID cards
11合同协议Contracts, agreements, leases, work authorizations
12医疗记录Medical records, prescriptions, lab reports
13证明证书Certificates, proof letters, notarizations
14移民文件I-797, I-94, I-20, EAD, green card
15签证申请Visa applications (US, China, Canada, etc.)
16工资单Pay stubs
17房产文件Mortgage, permits, property tax, HOA
18车辆文件Car leases, DMV, registration
28行程预订Hotel/flight booking confirmations, event tickets, passes, permits (SNO-PARK, etc.)
29报价估价Repair estimates, construction quotes, project proposals (pre-payment)

Key design decisions:

  • No English names needed — Paperless ML learns from whatever you use
  • 发票收据 = paid transactions only — Hotel confirmations go to 行程预订; repair estimates go to 报价估价; only completed-payment docs go here
  • 行程预订 vs 发票收据 — A hotel confirmation is a booking, not a receipt; the receipt comes when you check out. Tickets and passes live here too
  • 报价估价 vs 发票收据 — Unpaid estimates and quotes are pre-purchase; once paid and invoiced they become 发票收据
  • 设备信息 ≠ 参考资料 — Equipment records (serial numbers, warranties) are structurally different from reference material (price lists, brochures)
  • 活动通告 ≠ 日程课表 — Event notices are one-time announcements; schedules are recurring reference docs
  • 金融账单 merges bank + investment — Both are periodic statements; the correspondent (HSBC vs Vanguard) distinguishes them if needed
  • 税务文件 covers all tax docs — No need to split W-2 vs 1099 vs return at the type level; tags and correspondents carry that nuance
  • 移民文件 vs 签证申请 — Maintained status documents (I-797, EAD) vs active application packages are genuinely different workflows
  • 医疗账单 → 发票收据 + #医疗 tag — Medical bills are receipts; the tag provides the “medical” dimension without a redundant type

Tags
#

Tags handle cross-cutting dimensions that don’t belong in doc types:

TagPurpose
#保险Insurance-related
#医疗Medical topic
#教育Education
#财务Finance topic
#房产Real estate
#旅行Travel
#车辆Vehicle
#移民Immigration topic
#签证Visa topic
#税务Tax topic
#待处理Inbox / needs review
#重要Important, time-sensitive
#归档Archived, no action needed

Year tags were considered and rejected. Initial plan included #2016 through #2026 on every document. After reflection: year tags add noise to the ML training signal, and a “Document Year” custom field covers the use case more cleanly (filterable, sortable, not cluttering the tag cloud).

Custom Fields
#

IDNameTypeUsage
1AmountfloatInvoice/bill amounts
2Bill PerioddateStatement period end date
3到期日期dateExpiry date (documents, visas)
4Document YearintegerTax year, document year for historical docs
5保单/账号stringPolicy number, account number

Correspondents
#

Set up for entities that appear frequently enough to be useful as filters:

IDName
4Total Vision Campbell
5IRS
6California FTB
7Google LLC
8USCIS
9US Dept of State
10Vanguard
11HSBC
12County of Santa Clara

Pre-Import: Disable ML Matching
#

Before bulk importing 400 documents, disable Auto/ML matching on all document types, tags, and correspondents. If auto-matching is active during import, Paperless may attempt to re-classify documents using a half-trained model and overwrite your carefully assigned metadata.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Disable matching on all document types
curl -s http://10.0.10.10:5656/api/document_types/?page_size=50 \
  -H "Authorization: Token YOUR_TOKEN" | jq '.results[] | .id' | \
while read id; do
  curl -s -X PATCH http://10.0.10.10:5656/api/document_types/$id/ \
    -H "Authorization: Token YOUR_TOKEN" \
    -H "Content-Type: application/json" \
    -d '{"matching_algorithm": 0}'
done

# Same for tags and correspondents
# matching_algorithm: 0=None, 1=Any, 2=All, 3=Literal, 4=Regex, 6=Auto/ML

Re-enable after import with "matching_algorithm": 6 for types, tags, and correspondents where you want ML suggestions.

Exception: #待处理 should stay at 0 (None) permanently. It’s a status tag applied by workflow, not a content category — the ML has no business guessing it.

Migration Script
#

migrate.py handles classification and upload in one pass. It reads directly from the Google Takeout .zip without extracting everything first.

Classification Logic
#

Each file is classified by its folder path using a priority-ordered chain of if checks:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def classify(rel_path: str) -> dict:
    p = rel_path
    filename = Path(p).name
    year = extract_year(Path(p).parts)  # from any path component
    corr = corr_from_filename(filename) # keyword match on filename

    if "10 - 身份证件" in p:
        return result(DOCTYPE["身份证件"], [TAG["移民"]])

    if "30 - 移民文档" in p:
        return result(DOCTYPE["移民文件"], [TAG["移民"]])

    if "30 - Tax Filing" in p:
        tax_tags = [TAG["税务"]]
        if re.search(r'w[\-\s]?2\b', fl):
            return result(DOCTYPE["税务文件"], tax_tags, corr or CORR["Google LLC"])
        if re.search(r'(f?1099|f?1098|f?5498|1095)', fl):
            return result(DOCTYPE["税务文件"], tax_tags, corr)
        return result(DOCTYPE["税务文件"], tax_tags)

    # ... more rules ...

    return result(None, [TAG["待处理"]])  # fallback: inbox

Year is extracted from any path component matching \b(20[12]\d)\b — so 30 - Tax Filing/2022/W2.pdf gets year=2022 automatically.

Correspondent is inferred from filename keywords first (google, hsbc, vanguard, irs, ftb), then overridden by folder-specific rules.

Upload
#

1
2
3
4
5
6
7
resp = requests.post(
    f"{API_BASE}/documents/post_document/",
    headers={"Authorization": f"Token {API_TOKEN}"},
    files={"document": (filename, data)},
    data=form_data,   # list of (key, value) tuples for repeated fields
    timeout=120,
)

Tags must be sent as repeated form fields (not a JSON array):

1
2
for tag_id in meta["tags"]:
    form_data.append(("tags", tag_id))

Custom fields use JSON-encoded dict with string keys:

1
form_data.append(("custom_fields", json.dumps({str(YEAR_FIELD_ID): year})))

File Type Filtering
#

Skip non-document files that made it into the Takeout:

1
2
3
4
5
6
SKIP_EXTENSIONS = {
    ".html", ".csv", ".qfx", ".gsheet", ".gdoc", ".gslides",
    ".gdraw", ".gmap", ".java", ".bin", ".db", ".exe", ".7z",
    ".rar", ".gshortcut", ".pptx", ".xls", ".xlsx", ".tar",
    ".doc", ".docx",  # Paperless can't OCR these reliably
}

.doc/.docx are technically supported by Paperless but unreliable for OCR; export to PDF first if you care about full-text search on those.

Dry Run
#

1
python3 migrate.py --dry-run

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
=================================================================
  IMPORT PREVIEW — 358 files to import
=================================================================

By document type:
  税务文件        89 files
  移民文件        72 files
  金融账单        61 files
  ...

Skipped (non-document files): 47
  ext:.html                       18
  ext:.csv                        12
  ...

Sample mappings (first 30):
  10 - 文书材料/30 - Tax Filing/2022/W2-Google.pdf
    → 税务文件 | tags:['税务'] | corr:Google LLC | year:2022
    title: W2-Google

Execute
#

1
python3 migrate.py --execute --yes

The --yes flag skips the confirmation prompt (necessary when running over SSH where input() hangs). Paperless deduplicates by content hash, so re-running is safe.

Results
#

358 documents imported across 17 document types (19 after taxonomy expansion; see below). Upload took ~25 minutes including OCR processing time. Zero errors after fixing the .doc/.docx exclusions.

ML Classifier Training
#

After all documents were OCR-processed and the #待处理 tag bulk-removed:

1
2
3
# Run from NAS with docker permissions
/usr/local/bin/docker exec paperless-webserver-1 \
  python3 manage.py document_create_classifier

With ~400 labeled documents, the classifier has a solid training set for the most common document types (税务文件, 移民文件, 金融账单 each have 50-90 examples). Rarer types like 操作手册 or 活动通告 will improve as more documents are added.

Wait until OCR is complete before training. The classifier trains on the OCR’d content, not the raw file. Running too early means training on empty or partial text.

Inbox Workflow
#

New documents (uploaded manually, scanned via mobile app, or imported via email) automatically get tagged #待处理:

Settings → Workflows → Add Workflow:

  • Name: New Document Inbox
  • Trigger: Document Added (type 2)
  • Action: Assign Tag #待处理 (action type 1)

This gives you a reliable inbox view without relying on ML guessing. The workflow fires on every new document regardless of source.

#待处理 has matching_algorithm: 0 (None) — it is never assigned by ML, only by workflow. This keeps it clean as a status signal.

Clear the inbox by removing #待处理 after reviewing a document.

Ongoing Usage Patterns
#

Intake sources
#

SourceMethod
Paper documentsScan with mobile app (e.g. Scanner Pro), upload to Paperless
Email attachmentsPaperless email inbox (IMAP polling configured separately)
Downloaded PDFsDrag-drop to Paperless UI or consume folder
Consume folderSMB share from NAS, accessible from Windows/Mac

Review workflow
#

  1. Open saved search: tag:#待处理
  2. For each document: verify type, add any missing tags, fix title if needed
  3. Remove #待处理 — document moves out of inbox
  4. Add #重要 for anything time-sensitive or expiring soon

Finding documents
#

  • Full-text search handles most cases (e.g. “EAD renewal 2023”)
  • Filter by document type + correspondent for statements (金融账单 + Vanguard)
  • Filter by correspondent + Document Year custom field for tax docs
  • Tag #归档 on anything fully processed and unlikely to need action

Credential backup
#

Store Paperless admin password and API token in your password manager (Bitwarden). The API token lives in Settings → Tokens; regenerate and update any scripts if rotated.

Debugging Notes
#

待处理 appearing on all imported docs
#

If you ran the ML classifier before disabling Auto/ML on #待处理, it may have learned to tag everything. Bulk-remove via API:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import requests

API_BASE  = "http://10.0.10.10:5656/api"
API_TOKEN = "YOUR_TOKEN"
TAG_ID    = 7  # #待处理

headers = {"Authorization": f"Token {API_TOKEN}",
           "Content-Type": "application/json"}

# Get all docs with this tag
r = requests.get(f"{API_BASE}/documents/?tags__id__all={TAG_ID}&page_size=500",
                 headers=headers)
doc_ids = [d["id"] for d in r.json()["results"]]
print(f"Total docs to clean: {len(doc_ids)}")

# Bulk remove tag
r = requests.post(f"{API_BASE}/documents/bulk_edit/",
    headers=headers,
    json={"documents": doc_ids,
          "method": "remove_tag",
          "parameters": {"tag": TAG_ID}})
print(r.status_code, r.text)

Document type auto-assigned incorrectly after import
#

ML may assign wrong types if it trained on an imbalanced or noisy corpus. Check the document, correct manually, then retrain:

1
2
/usr/local/bin/docker exec paperless-webserver-1 \
  python3 manage.py document_create_classifier

Each manual correction feeds back into the training set.

custom_fields format error
#

Paperless expects custom fields as a JSON-encoded dict with string keys:

1
2
3
4
5
# ✓ Correct
json.dumps({"4": 2024})

# ✗ Wrong — causes 400 error
json.dumps([{"field": 4, "value": 2024}])

Port conflicts
#

If Paperless is unreachable, check that no other service is using port 5656 on the NAS. DSM firewall may also block it from certain subnets — check Control Panel → Security → Firewall.

My Setup
#

ItemValue
NASSynology DS1621+
NAS IP10.0.10.10
Paperless port5656
Docker imageghcr.io/paperless-ngx/paperless-ngx:latest
Documents imported358
Import sourceGoogle Takeout (single .zip, ~2 GB)
LanguagesChinese + English (OCR configured for both)
Training corpus~400 labeled docs across 19 types

Notes
#

  • The post_document endpoint is async — Paperless queues the document for OCR and returns immediately with {"result": "OK"}. The document won’t be searchable until OCR completes (usually seconds to a minute per doc, depending on NAS load)
  • Paperless deduplicates by SHA256 of the file content — safe to re-run the import script; duplicates are silently skipped
  • Google Takeout exports Google Docs/Sheets as their native format by default; to get PDFs, use the “Export format: PDF” option when creating the Takeout
  • The document_create_classifier command prints No updates since last training if the training set hasn’t changed since the last run — this is normal, not an error