Example Prompt
Scenario 1 You are tasked with building a data processing pipeline in Python. The pipeline must ingest data from various sources such as web traffic logs, sales transactions, and social media feeds. Once the data is ingested, it should be cleaned and transformed using Apache Spark. The processed data needs to be stored in Amazon S3. The pipeline should also include error handling with logs sent to CloudWatch and integrate with a machine learning model to provide real-time predictions on incoming data. Finally, the pipeline should be scalable and deployable within a CI/CD pipeline.
Your Task:
Write & try a prompt ( on any GenAI tool - chatgpt/claude etc) to generate the code for this data processing pipeline. Your prompt should include the following details:
Technology Stack:
Mention that the pipeline should be built in Python, using Apache Spark for data transformation, Amazon S3 for storage, and CloudWatch for logging.
Data Sources:
Specify the different data sources (web traffic logs, sales transactions, and social media feeds) that the pipeline should handle.
Functionality:
Include the need for data cleaning and transformation, error handling, and integration with a machine learning model for real-time predictions.
Scalability and Deployment:
Emphasise that the pipeline must be scalable and ready for deployment in a CI/CD pipeline.
You are an experienced data engineer with expertise in building scalable data processing pipelines using Python. You have knowledge of integrating various technologies for data ingestion, processing, storage, and real-time predictions.
Task: Generate Python code for a data processing pipeline that meets the specified requirements. Requirements:
The pipeline must be built in Python. Apache Spark is used for data transformation.
Amazon S3 will be used for storing processed data.
CloudWatch is required for error logging. The pipeline needs to handle data from web traffic logs, sales transactions, and social media feeds.
It should include data cleaning and transformation, error handling, and integration with a machine learning model for real-time predictions.
The pipeline must be scalable and deployable within a CI/CD pipeline.
Rules:
- Write a modular code without code smells.
- Add simple commands for understanding.
- Include file names and folder structure
- Finally give instruction to run step by step.