Standards Statement: Non-Proliferation of Information

To ensure efficiency, compliance, and trust in the Data Warehouse, we commit to the principle of non-proliferation of information. This means that data will be held and processed in a way that avoids unnecessary duplication, ensures clear lineage, and retains only what is required for business and regulatory purposes.

Standards

Single Source of Truth – Data will have one authoritative source. Derived copies (e.g. marts, extracts) must be traceable to this source.
Access Control – All data will be protected by role-based access, with sensitive attributes masked or restricted.
Retention & Archiving – Daily landings are retained for 12 months to support audit and recovery. Beyond this, archives are compacted to monthly snapshots and retained for up to 5 years, after which data will be securely deleted or moved to regulated long-term storage.
Change Tracking – Slowly Changing Dimensions (SCDs) and fact histories will be built from controlled deltas or retained snapshots, not uncontrolled copies.
Auditability – All movement and transformation of data will be logged, with lineage made visible through metadata.

This approach ensures the Data Warehouse balances business value, regulatory compliance, and cost-efficient stewardship of information, without uncontrolled growth of redundant data.

1. Standards to Set for Non-Proliferation of Information

We want to align with both data governance and data engineering best practice. Key standards include:

🔒 Security & Access

Least privilege: Users and systems only see the data they need.
Role-based access control (RBAC) at schema, container, or even column level (for sensitive attributes).
PII & sensitive data masking: Mask or hash personal identifiers where not required for analytics.

📦 Storage & Duplication

Single source of truth (SSOT) principle: Data should only have one authoritative source (our bronze/raw layer is fine, but avoid proliferating uncontrolled “copies” in spreadsheets, shadow databases, etc.).
Metadata-driven pipelines: Ensure all copies are traceable and explainable.
Versioning not replication: Prefer metadata snapshots, audit tables, or change-tracking logs instead of storing uncontrolled duplicates.

🧹 Lifecycle Management

Retention & archiving policy: Define how long we keep raw/archived data. For regulatory compliance this could be 5, 7, or 10 years depending on domain.
Deletion policy: Make clear when and how data is deleted (soft vs hard delete, GDPR/FOIA considerations).

📑 Auditability & Transparency

Lineage tracking: Be able to show where data came from and where it goes.
Logging & monitoring: Ensure movement between containers/layers is recorded.

2. Archiving Daily Landings (365 × 5 years ≈ 1,825 files/partitions)

We are effectively describing a daily snapshot archive strategy. That’s common for supporting slowly changing dimensions (SCDs) and fact table reconciliation. Some considerations:

👍 Benefits

Supports point-in-time recovery if we need to rebuild historical SCDs or facts.
Creates an immutable audit trail (can prove what the source looked like on any given day).

⚠️ Concerns

Storage bloat: 5 years × 365 days could be large, depending on file size and granularity.
Performance: Rebuilding SCDs from thousands of partitions can be heavy unless partitioning/indexing is well designed.
Non-proliferation risk: If each landing = full dataset, we’re storing huge redundant copies instead of just changes.

✅ Mitigations / Best Practices

Delta storage model: Instead of full daily snapshots, store deltas (inserts/updates/deletes) alongside a checkpoint. This is more “non-proliferation friendly”.
Partitioning: Partition by date so queries only touch what’s needed.
Compaction: Periodically compact older daily partitions into monthly/yearly snapshots.
Lifecycle policies: For example:
- Keep daily snapshots for 1 year
- Compact to weekly/monthly snapshots for years 2–5
- Beyond 5 years, keep only compliance-critical extracts.
Cold storage tiering: Move old snapshots into cheaper storage tiers (Azure Archive/Glacier equivalent).

3. Suggested Position

We should be able to say:

Yes, we need an archive to support SCDs/facts.
But we should set a standard for non-proliferation so we don’t just hoard duplicates:
- Use delta storage or compaction to reduce volume.
- Apply retention rules (e.g. 1 year daily, 4 years monthly).
- Ensure archived data is access-controlled and traceable.

That balances regulatory needs, technical needs, and governance principles.

{"Current stable version":"202409","exec":"@version = ‘202409’","Development" : "@version = ‘beta’","Change":"Regional View"}

{"Current stable version":"202409","exec":"@version = ‘202409’","Development" : "@version = ‘beta’","Change":"Stable 09 View"}

Expected Condition The DataMart View BusOpp.Dates is in the style of 202409 As Found Condition The unversioned (beta) edition of…

Double check your WITH READ ONLY — it’s typically only used on views, not raw SELECT queries unless you're defining…

{"Current stable version":"202409","exec":"@version = ‘202409’","Development" : "@version = ‘beta’","Change":"Cross Apply version"}

{"Current stable version":"202409","exec":"@version = ‘202409’","Development" : "@version = ‘beta’","Change":"CTE Version"}