Site Reliability Engineer (SRE)
Job Description :
We are looking for an experienced and highly skilled Site Reliability Engineer (SRE) to join our dynamic team. The ideal candidate will have expertise in managing cloud infrastructure, particularly with Google Cloud Platform (GCP), and extensive experience with containerization technologies like Docker and Kubernetes. This role demands strong automation, scripting capabilities, and experience with infrastructure as code (IaC), monitoring, troubleshooting, and Helm for Kubernetes package management.
Key Responsibilities:
- Google Cloud Platform Management: Manage, monitor, and optimize cloud infrastructure on GCP, ensuring high availability, performance, and scalability of services.
- Containerization with Docker and Kubernetes: Build, deploy, and manage containerized applications in Kubernetes (GKE) clusters. Optimize the container lifecycle and orchestration.
- Helm for Kubernetes: Use Helm for managing Kubernetes applications, simplifying deployment, and versioning of applications across environments.
- Scripting & Automation: Automate cloud infrastructure tasks using scripting languages such as Python, Bash, or similar, streamlining deployment and operational processes.
- Infrastructure as Code (IaC): Use Terraform for automating the provisioning and management of cloud resources. Maintain clean, scalable, and reusable infrastructure code.
- Monitoring & Logging: Implement and manage monitoring and logging tools such as GCP Monitoring, Nagios, Prometheus, and Grafana to ensure visibility into system performance and health.
- Troubleshooting & Incident Management: Troubleshoot applications and services deployed in GKE. Proactively resolve issues with infrastructure and services to minimize downtime.
- Database & Networking: Manage and troubleshoot Cloud SQL databases and handle network configuration and management to ensure efficient communication between services.
- CI/CD Pipeline Management: Manage and enhance CI/CD pipelines using GitHub Actions. Ensure smooth code deployment processes and enable continuous integration and delivery.
- Collaboration & Problem-Solving: Work closely with development teams to identify and resolve application issues. Use strong problem-solving skills to debug and mitigate issues in production.
Qualifications:
- Proven experience with Google Cloud Platform (GCP) and its services.
- Expertise in containerization using Docker and orchestration with Kubernetes (GKE).
- Proficient in using Helm for managing Kubernetes applications.
- Strong experience in scripting and automation using Python, Bash, or similar languages.
- Hands-on experience with Terraform for infrastructure automation.
- Experience with monitoring and logging tools such as GCP Monitoring, Nagios, Prometheus, and Grafana.
- Solid understanding of database management (SQL/NoSQL) and networking fundamentals.
- Experience with CI/CD pipelines, particularly using GitHub Actions, and strong knowledge of Git.
- Excellent troubleshooting and problem-solving skills, particularly in diagnosing issues with applications deployed in GKE.
Eligibilty criteria :
- Minimum 5 years of work exp in the same field
Location :
- Work from home
Schedule :
- Monday to Friday
Shift timing :
- Rotational shift