Data Engineering Session 1
kd = key decisions
- ad users → graph api
- how to connect to data source
- connecting to diff file formats (csv, excel, txt)
- first step to have a common ingestion framework(cif) → to whatever platform you are using (databricks in this case)
- identify patterns and leverage it into a framework
- can provide custom validation json to make
- cover as many pain points as possible
- sell this as a product internally
- save a lot of value
- Data validation strategy
- we won’t have enough data to test
- to test mill, 2 mill records
- how to maintain data quality to new system
- whenever data bricks object are deployed in prod, we compare gold layer objects with prev systems and we map each row
- Row count, row-to-row, last 3 months comp
- will need compute but will be beneficial
- when data pipelines built → sit runs → generates report
commercial ops
microsoft azure → databricks → adf, adb
ingestion
document best practices
- bronze, silver, and gold layers
- raw, curated, and consumption layers migration: hive → databricks