hero

Black Nova VC Opportunities

Discover the opportunities with our startups that are championing the next generation of B2B technology.
Black Nova
Black Nova
30
companies
16
Jobs

Platform Engineer - Kubernetes - Contractor Role - CST TZ

Portainer

Portainer

Software Engineering
Argentina
Posted on Thursday, June 27, 2024

We are seeking a highly skilled and experienced Platform Engineer to join our remote team. The ideal candidate will have extensive experience in Kubernetes/Swarm administration, troubleshooting across all components, infrastructure, observability, and platform engineering. This role will involve managing large-scale Kubernetes environments, implementing, maintaining and ensuring the reliability and scalability of the platform. You will also be part of an on-call rotation to handle critical incidents.

The role includes (but may not be limited to) the following functions:

- Kubernetes Management:

  • Manage and optimise large-scale Kubernetes clusters.
  • Perform version updates, configuration changes, and troubleshoot issues.
  • Assist with and maintain container orchestration using Kubernetes.
  • Platform Engineering Services:
  • Maintain and expand the platform solution to meet SLA/OLS requirements.
  • Perform platform moves/adds/changes and monitor core platform metrics.
  • Manage load across components and ensure normal operating parameters.
  • Implement component updates for defect resolution and preventive maintenance.

- Operational Onboarding:

  • Create and maintain documentation for service levels, roles, and responsibilities.
  • Conduct platform reviews and tooling deployments.

- DevOps and SRE:

  • Aid in the use of GitOps pipelines and assist in application deployment strategies.
  • Provide guidance on namespace, cluster, access control, and isolation best practises.
  • Implement blue/green deployment strategies and assist with performance issues.

- Automation and DR Planning:

  • Develop automations for preventative maintenance and operational efficiency.
  • Create and validate cluster recovery guides to ensure infrastructure recoverability.

- Emergency Support:

  • Provide 24/7 emergency engineering support with a 1-hour response SLA.
  • Analyse alerts and perform root analysis to prevent recurrence.