To ensure efficiency, compliance, and trust in the Data Warehouse, we commit to the principle of non-proliferation of information. This means that data will be held and processed in a way that avoids unnecessary duplication, ensures clear lineage, and retains only what is required for business and regulatory purposes.
Standards
- Single Source of Truth – Data will have one authoritative source. Derived copies (e.g. marts, extracts) must be traceable to this source.
- Access Control – All data will be protected by role-based access, with sensitive attributes masked or restricted.
- Retention & Archiving – Daily landings are retained for 12 months to support audit and recovery. Beyond this, archives are compacted to monthly snapshots and retained for up to 5 years, after which data will be securely deleted or moved to regulated long-term storage.
- Change Tracking – Slowly Changing Dimensions (SCDs) and fact histories will be built from controlled deltas or retained snapshots, not uncontrolled copies.
- Auditability – All movement and transformation of data will be logged, with lineage made visible through metadata.
This approach ensures the Data Warehouse balances business value, regulatory compliance, and cost-efficient stewardship of information, without uncontrolled growth of redundant data.
1. Standards to Set for Non-Proliferation of Information
We want to align with both data governance and data engineering best practice. Key standards include:
🔒 Security & Access
- Least privilege: Users and systems only see the data they need.
- Role-based access control (RBAC) at schema, container, or even column level (for sensitive attributes).
- PII & sensitive data masking: Mask or hash personal identifiers where not required for analytics.
📦 Storage & Duplication
- Single source of truth (SSOT) principle: Data should only have one authoritative source (our bronze/raw layer is fine, but avoid proliferating uncontrolled “copies” in spreadsheets, shadow databases, etc.).
- Metadata-driven pipelines: Ensure all copies are traceable and explainable.
- Versioning not replication: Prefer metadata snapshots, audit tables, or change-tracking logs instead of storing uncontrolled duplicates.
🧹 Lifecycle Management
- Retention & archiving policy: Define how long we keep raw/archived data. For regulatory compliance this could be 5, 7, or 10 years depending on domain.
- Deletion policy: Make clear when and how data is deleted (soft vs hard delete, GDPR/FOIA considerations).
📑 Auditability & Transparency
- Lineage tracking: Be able to show where data came from and where it goes.
- Logging & monitoring: Ensure movement between containers/layers is recorded.
2. Archiving Daily Landings (365 × 5 years ≈ 1,825 files/partitions)
We are effectively describing a daily snapshot archive strategy. That’s common for supporting slowly changing dimensions (SCDs) and fact table reconciliation. Some considerations:
👍 Benefits
- Supports point-in-time recovery if we need to rebuild historical SCDs or facts.
- Creates an immutable audit trail (can prove what the source looked like on any given day).
⚠️ Concerns
- Storage bloat: 5 years × 365 days could be large, depending on file size and granularity.
- Performance: Rebuilding SCDs from thousands of partitions can be heavy unless partitioning/indexing is well designed.
- Non-proliferation risk: If each landing = full dataset, we’re storing huge redundant copies instead of just changes.
✅ Mitigations / Best Practices
- Delta storage model: Instead of full daily snapshots, store deltas (inserts/updates/deletes) alongside a checkpoint. This is more “non-proliferation friendly”.
- Partitioning: Partition by date so queries only touch what’s needed.
- Compaction: Periodically compact older daily partitions into monthly/yearly snapshots.
- Lifecycle policies: For example:
- Keep daily snapshots for 1 year
- Compact to weekly/monthly snapshots for years 2–5
- Beyond 5 years, keep only compliance-critical extracts.
- Cold storage tiering: Move old snapshots into cheaper storage tiers (Azure Archive/Glacier equivalent).
3. Suggested Position
We should be able to say:
- Yes, we need an archive to support SCDs/facts.
- But we should set a standard for non-proliferation so we don’t just hoard duplicates:
- Use delta storage or compaction to reduce volume.
- Apply retention rules (e.g. 1 year daily, 4 years monthly).
- Ensure archived data is access-controlled and traceable.
That balances regulatory needs, technical needs, and governance principles.