Standards Statement: Non-Proliferation of Information

To ensure efficiency, compliance, and trust in the Data Warehouse, we commit to the principle of non-proliferation of information. This means that data will be held and processed in a way that avoids unnecessary duplication, ensures clear lineage, and retains only what is required for business and regulatory purposes.

Standards

  • Single Source of Truth – Data will have one authoritative source. Derived copies (e.g. marts, extracts) must be traceable to this source.
  • Access Control – All data will be protected by role-based access, with sensitive attributes masked or restricted.
  • Retention & Archiving – Daily landings are retained for 12 months to support audit and recovery. Beyond this, archives are compacted to monthly snapshots and retained for up to 5 years, after which data will be securely deleted or moved to regulated long-term storage.
  • Change Tracking – Slowly Changing Dimensions (SCDs) and fact histories will be built from controlled deltas or retained snapshots, not uncontrolled copies.
  • Auditability – All movement and transformation of data will be logged, with lineage made visible through metadata.

This approach ensures the Data Warehouse balances business value, regulatory compliance, and cost-efficient stewardship of information, without uncontrolled growth of redundant data.


1. Standards to Set for Non-Proliferation of Information

We want to align with both data governance and data engineering best practice. Key standards include:

🔒 Security & Access

  • Least privilege: Users and systems only see the data they need.
  • Role-based access control (RBAC) at schema, container, or even column level (for sensitive attributes).
  • PII & sensitive data masking: Mask or hash personal identifiers where not required for analytics.

📦 Storage & Duplication

  • Single source of truth (SSOT) principle: Data should only have one authoritative source (our bronze/raw layer is fine, but avoid proliferating uncontrolled “copies” in spreadsheets, shadow databases, etc.).
  • Metadata-driven pipelines: Ensure all copies are traceable and explainable.
  • Versioning not replication: Prefer metadata snapshots, audit tables, or change-tracking logs instead of storing uncontrolled duplicates.

🧹 Lifecycle Management

  • Retention & archiving policy: Define how long we keep raw/archived data. For regulatory compliance this could be 5, 7, or 10 years depending on domain.
  • Deletion policy: Make clear when and how data is deleted (soft vs hard delete, GDPR/FOIA considerations).

📑 Auditability & Transparency

  • Lineage tracking: Be able to show where data came from and where it goes.
  • Logging & monitoring: Ensure movement between containers/layers is recorded.

2. Archiving Daily Landings (365 × 5 years ≈ 1,825 files/partitions)

We are effectively describing a daily snapshot archive strategy. That’s common for supporting slowly changing dimensions (SCDs) and fact table reconciliation. Some considerations:

👍 Benefits

  • Supports point-in-time recovery if we need to rebuild historical SCDs or facts.
  • Creates an immutable audit trail (can prove what the source looked like on any given day).

⚠️ Concerns

  • Storage bloat: 5 years × 365 days could be large, depending on file size and granularity.
  • Performance: Rebuilding SCDs from thousands of partitions can be heavy unless partitioning/indexing is well designed.
  • Non-proliferation risk: If each landing = full dataset, we’re storing huge redundant copies instead of just changes.

✅ Mitigations / Best Practices

  • Delta storage model: Instead of full daily snapshots, store deltas (inserts/updates/deletes) alongside a checkpoint. This is more “non-proliferation friendly”.
  • Partitioning: Partition by date so queries only touch what’s needed.
  • Compaction: Periodically compact older daily partitions into monthly/yearly snapshots.
  • Lifecycle policies: For example:
    • Keep daily snapshots for 1 year
    • Compact to weekly/monthly snapshots for years 2–5
    • Beyond 5 years, keep only compliance-critical extracts.
  • Cold storage tiering: Move old snapshots into cheaper storage tiers (Azure Archive/Glacier equivalent).

3. Suggested Position

We should be able to say:

  • Yes, we need an archive to support SCDs/facts.
  • But we should set a standard for non-proliferation so we don’t just hoard duplicates:
    • Use delta storage or compaction to reduce volume.
    • Apply retention rules (e.g. 1 year daily, 4 years monthly).
    • Ensure archived data is access-controlled and traceable.

That balances regulatory needs, technical needs, and governance principles.

Leave a Comment