Prometheus Relabelconfigs Example Jobs in Usa
41 positions found — Page 3
Job Description
The purpose of this role is to develop job plans for Maintenance Work Orders, so the work can be scheduled and executed in a safe, efficient, and effective manner per the steps listed in the job plan.
What will be expected from you?
- Reviews work orders for completeness / all required information (based on a good quality work order) and returns work orders to Operations Maintenance Coordinator if not complete. Escalates issues to the Maintenance Work Process Lead for chronic issues.
- Uses archived plans (Task List) where possible for the basis of creating new plans.
- Where provided, uses guides for Value Added Job Plans and guidance for estimating labor requirements. Goal is to eliminate non-value-added planning activities and ensure that job plans are not fat.
- Uses “quality feedback” from the person who completed the work to optimize archived plans. Seeks approval from the Planning Lead before changing numbers of crafts and job duration.
- Provides feedback to the person who supplied the Planner Feedback Form on how the information was used (updates, communication, etc.). This includes decisions not to make changes.
- Carries out field check when needed to verify job plan.
- Prepares job plan and inputs plan into SAP indicating the required resources, materials, and services.
- Prepares the job package for the Work Coordinator and Crafts to use to ensure a safe, quality, productive job execution.
- Works with the Material Coordinator after material availability has been confirmed to update the Material code in SAP to RSCD identifying job is ready to schedule.
- When material will NOT meet the requested Basic Finish Date (supplied by Material Coordinator), works with the Operations Maintenance Coordinator to identify a new date when material will be available. When applicable, works with Operations Maintenance Coordinator for approval to expedite material to meet a faster date.
- Identifies task activities necessary to execute a job plan safely, efficiently, and effectively.
- Identifies resources, materials, tools, and equipment required to perform tasks.
- Identifies / solicits safety information and requirements necessary to safely perform tasks.
- Records tasks / information within the work order.
- Notifies other crafts of their requirement to provide input to the job plan.
- Updates and archives repetitive job plans (updates Task List). Validates good data before saving.
- Considers constructability and maintainability when developing plans.
- Considers material costs when selecting materials from specifications.
- Utilizes “Best Practices” where applicable during the planning process.
- Uses Subject Matter Experts (SME’s) when necessary while building proactive repair job plans to support Reliability.
- Works collaboratively as a team with Operations Maintenance Coordinator, Scheduler, Material Coordinator, Maintenance Execution, and Maintenance Engineering.
- Maintains the backlog in assigned Planner Group Code to the agreed-on backlog, escalates when backlog is over the limit, helps others with their backlog when available.
- Reviews backlog daily for key actions.
- Identify long-lead delivery items and inform Material Coordinator.
- Identify “Grab and Go/Fill in Work” jobs (jobs that can be done with no planning, and no or minimal disruption to operations
- Attends Planner-required meetings including, but not limited to, planning meeting, schedule review meeting, backlog meeting, monthly safety meeting, and quarterly town-hall meeting.
- Maintains metrics to support role improvement.
- Ensure that the activities are carried out to comply with the integrated management system, as applicable (safety, food safety, GMP, health, environmental, quality and responsible care requirements).
- Essential functions require presence in the workplace on a regular basis and an ability to work extra hours if needed. If applicable, ability to work overtime may be needed to ensure required staffing capacity to meet daily production objectives.
- Prepare annual and bi-annual maintenance plans for all major plant Instrument/Electrical or Mechanical equipment site wide including opportunistic shutdowns. Plans should include detailed schedules of all tasks to be performed, the necessary resources allocated both internal and external, the preparation of all work descriptions for outsourced activities and the identification and ordering of critical spares and/or special tools or equipment needed for the planned repairs.
- Scope bid and coordinate with procurement to award Instrument/Electrical or Mechanical/soft craft contractor work that is both safe and effective.
- Directly accountable and responsible for those aspects of the Instrument/Electrical or Mechanical/soft craft work orders planning backlog and quality.
- Review work order for accuracy and completeness before it is assigned to ready to schedule (RTS).
- Carry out other functions as assigned by the Planning and Scheduling Manager
- Work collaboratively with team members with different backgrounds and perspectives.
- Assists other employees in accomplishments of Indorama company goals. Follows instructions and performs other duties as may be assigned by supervisor.
- Participates in Environmental, Health, & Safety initiatives as set forth by the company. Participates in and completes company required training programs.
What are we looking for in the Ideal candidate?
- High School Diploma or Equivalent AND a minimum of 5 years of experience as a maintenance technician/ specialist and at least 5 years of Preventive Maintenance Planning in a manufacturing environment OR
- Associate degree AND a minimum of 3 years of experience as a maintenance technician/specialist and at least 2 years of Preventive Maintenance Planning in a manufacturing environment.
Required
- Instrument/Electrical or Mechanical Maintenance background is required.
- Demonstrated ability to operate computers and WINDOWS software such as WORD, OUTLOOK and EXCEL.
- Demonstrated ability to independently plan, organize, prioritize, and estimate.
- Demonstrated effective verbal and written communication skills.
- Ability to sequence job activities, acquire documentation, procure material, use specification references, prepare sketches, and maintain records.
- Ability to interpret blueprints, bill of materials, and detailed drawings.
Preferred
- Proficiency in SAP Materials Management with Prometheus a plus
- Proficiency in Maintenance Planning optimization
- Proficiency in SAP document Management System (DMS) for work order planning optimization
- Ability to learn new skills and work methods, acquire process knowledge, research maintenance technologies, and apply improvements in the contract maintenance management system.
- Ability to perform effectively in maintenance training activities in the electrical, instrumentation and related fields.
LTIMindtree is an equal opportunity employer that is committed to diversity in the workplace. Our employment decisions are made without regard to race, color, creed, religion, sex (including pregnancy, childbirth or related medical conditions), gender identity or expression, national origin, ancestry, age, family-care status, veteran status, marital status, civil union status, domestic partnership status, military service, handicap or disability or history of handicap or disability, genetic information, atypical hereditary cellular or blood trait, union affiliation, affectional or sexual orientation or preference, or any other characteristic protected by applicable federal, state, or local law, except where such considerations are bona fide occupational qualifications permitted by law.
A little about us...
Role: AWS DevOps Engineer
Location: Charlotte, NC
Salary: Market Rate
Job Description:
We are seeking a highly skilled Senior DevOps Engineer with strong expertise in AWS cloud infrastructure automation databases and modern containerized environments The ideal candidate will have experience designing implementing and maintaining scalable secure and reliable systems while enabling fast and efficient development workflows You will work closely with development architecture and operations teams to build robust CICD pipelines automate infrastructure provisioning and ensure high availability of business critical applications
Key Responsibilities:
- Design implement and manage AWS cloud infrastructure EC2 S3 Lambda ECSEKS etc with scalability and security in mind
- Develop and maintain Infrastructure as Code IaC using Terraform
- Build manage and optimize Docker base images and containerized application stacks
- Orchestrate and maintain Kubernetes EKS clusters for production and staging environments
- Set up manage and optimize CICD pipelines in GitLab to support fast reliable deployments
- Manage MCP servers and ensure reliable operations for critical services
- Automate operational tasks and workflows using Python and JavaScript
- Support fullstack teams React Nodejs by providing containerized environments and deployment strategies
- Manage and optimize databases SQL PostgreSQL for performance security and scalability
- Integrate and manage AWS streaming services Kinesis MSK Kafka or similar for realtime data pipelines
- Implement container image security scanning governance and lifecycle management
- Monitor system performance availability and cost implementing proactive improvements
- Ensure compliance with security and governance standards across cloud infrastructure and database layers
- Collaborate with developers and architects to improve application delivery scalability and resilience
Required Skills Qualifications:
- 8 years of experience in DevOps Cloud Infrastructure
- Strong Handson experience with AWS services EC2 S3 ECSEKS Lambda VPC IAM CloudWatch Kinesis MSK
- Proficiency in Terraform for infrastructure automation
- Expertise with Docker including base image creation and Kubernetes orchestration
- Strong scripting programming skills in Python and JavaScript
- Experience with GitLab CICD for pipelines automation and environment management
- Strong database experience with SQL and PostgreSQL setup scaling replication performance tuning
- Exposure to streaming architectures AWS Kinesis Kafka MSK or similar
- Experience supporting React based applications from a DevOps perspective
- Familiarity with MCP servers and containerized service deployments
- Knowledge of cloud cost optimization and security best practices
- Strong problem-solving troubleshooting and communication skills
- Preferred Qualifications
- AWS certifications eg AWS Certified Solutions Architect DevOps Engineer Professional
- Experience with monitoring observability tools Prometheus Grafana ELK Datadog
- Knowledge of networking load balancing and distributed system design
- Familiarity with Agile Scrum methodologies
Skills
- Mandatory Skills : AWS Lambda, Docker, Python
- Good to Have Skills : Ansible, Git, Kubernetes
LTIMindtree is an equal opportunity employer that is committed to diversity in the workplace. Our employment decisions are made without regard to race, color, creed, religion, sex (including pregnancy, childbirth or related medical conditions), gender identity or expression, national origin, ancestry, age, family-care status, veteran status, marital status, civil union status, domestic partnership status, military service, handicap or disability or history of handicap or disability, genetic information, atypical hereditary cellular or blood trait, union affiliation, affectional or sexual orientation or preference, or any other characteristic protected by applicable federal, state, or local law, except where such considerations are bona fide occupational qualifications permitted by law.
LTIMindtree is an equal opportunity employer that is committed to diversity in the workplace. Our employment decisions are made without regard to race, color, creed, religion, sex (including pregnancy, childbirth or related medical conditions), gender identity or expression, national origin, ancestry, age, family-care status, veteran status, marital status, civil union status, domestic partnership status, military service, handicap or disability or history of handicap or disability, genetic information, atypical hereditary cellular or blood trait, union affiliation, affectional or sexual orientation or preference, or any other characteristic protected by applicable federal, state, or local law, except where such considerations are bona fide occupational qualifications permitted by law.
A little about us...
Role: Azure DevOps Engineer
Location: Berkeley Heights, NJ
Job Description:
1. Extensive hands-on experience on GitHub Actions writing workflows in YAML using re-usable templates
2. Extensive hands-on experience with application CI/CD pipelines both for Azure and on-prem for different frameworks
3. Hands on experience with Azure DevOps and migration programs of CI/CD pipelines preferably from Azure DevOps to GitHub Actions
4. Proficiency in integrating and consuming REST APIs to achieve automation through scripting
5. Hands on experience with atleast 1 scripting language and has done out of box automations for platforms like People Soft, SharePoint, MDM etc
6. Hands on experience with CI/CD of databases
7. Good to have experience with infrastructure-as-code including ARM templates Terraform Azure CLI Azure PowerShell modules
8. Exposure to monitoring tools like ELK Prometheus Grafana
LTIMindtree is an equal opportunity employer that is committed to diversity in the workplace. Our employment decisions are made without regard to race, color, creed, religion, sex (including pregnancy, childbirth or related medical conditions), gender identity or expression, national origin, ancestry, age, family-care status, veteran status, marital status, civil union status, domestic partnership status, military service, handicap or disability or history of handicap or disability, genetic information, atypical hereditary cellular or blood trait, union affiliation, affectional or sexual orientation or preference, or any other characteristic protected by applicable federal, state, or local law, except where such considerations are bona fide occupational qualifications permitted by law.
Hi
I hope you’re doing well.
My name is Sai, and I’m an Account Manager with Astir IT Solutions. We are currently working with our client on a senior-level opportunity for Agentic AI QA Engineer at Dallas, TX (Need Locals)!
Based on your background, I believe this role could be a strong fit.
Job Title: Agentic AI QA Engineer
Location: Dallas, TX (Need Locals)
Experience: 7+ years
Position type: Contract W2/C2C
Required Qualifications
• 7+ years in Software QA/Testing, with 2+ years in AI/ML or LLM-based systems; hands-on experience testing agentic/multi-agent architectures.
• Strong programming skills in Python or TypeScript/JavaScript; experience building test harnesses, simulators, and fixtures.
• Experience with LLM evaluation (exact/soft match, BLEU/ROUGE, BERTScore, semantic similarity via embeddings), guardrails, and prompt testing.
• Expertise in distributed systems testing latency profiling, resiliency patterns (circuit breakers, retries), chaos engineering, and message queues.
• Familiarity with orchestration frameworks (LangChain, LangGraph, LlamaIndex, DSPy, OpenAI Assistants/Actions, Azure OpenAI orchestration, or similar).
• Proficiency with CI/CD (GitHub Actions/Azure DevOps), observability (OpenTelemetry, Prometheus/Grafana, Datadog), and feature flags/canaries.
• Solid understanding of privacy/security/compliance in AI systems (PII handling, content policies, model safety).
• Excellent communication and leadership skills; proven ability to work cross-functionally with Ops, Data, and Engineering.
Preferred Qualifications
• Experience with multi-agent simulators, agent graph testing, and tooling latency emulation.
• Knowledge of MLOps (model versioning, datasets, evaluation pipelines) and A/B experimentation for LLMs.
• Background in cloud (AWS), serverless, containerization, and event-driven architectures.
- Prior ownership of cost/latency/SLAs for AI workloads in production
If you are currently open to new opportunities, I would appreciate the chance to connect and discuss this role in more detail. Please let me know a convenient time for a quick call, or feel free to share your updated resume.
Looking forward to hearing from you.
Thanks & Regards.
Sai
Sr. Account Manager
Astir IT Solutions, Inc.
ID: , Contact: 732-694-6000 * 795
We are seeking an experienced Cloud Platform Engineer with deep expertise in Red Hat OpenShift and strong Linux systems engineering background. This role will be responsible for designing, building, and operating large-scale OpenShift platforms within on-premises datacenter environments.
The ideal candidate will work closely with SRE teams and Program Management to drive the successful implementation, scaling, and operationalization of enterprise-grade OpenShift infrastructure.
Key Responsibilities
1. Platform Engineering
- Design, deploy, and manage enterprise-scale Red Hat OpenShift clusters in on-prem datacenter environments.
- Architect highly available, scalable, and secure OpenShift platforms.
- Implement cluster lifecycle management (installation, upgrades, patching, scaling).
- Configure networking, storage, ingress, and security components for OpenShift.
2. Infrastructure Build & Automation
- Build and automate infrastructure in datacenter environments (compute, storage, networking).
- Integrate OpenShift with virtualization platforms (VMware/other hypervisors as applicable).
- Develop Infrastructure-as-Code (IaC) solutions using tools such as Terraform, Ansible, or similar.
- Implement CI/CD pipelines for platform deployments and updates.
3. Linux Systems Engineering
- Provide deep Linux system administration and troubleshooting support.
- Optimize OS-level configurations for performance, reliability, and security.
- Automate system configuration and compliance management.
- Diagnose and resolve complex kernel, networking, and storage issues.
4. Reliability & Operations
- Partner closely with the SRE team to establish SLOs, SLIs, monitoring, and alerting.
- Drive observability implementation (logging, metrics, tracing).
- Participate in incident management, root cause analysis (RCA), and remediation.
- Ensure platform resiliency, performance tuning, and capacity planning.
5. Program & Cross-Functional Collaboration
- Work with Program Management to drive large-scale OpenShift implementation milestones.
- Provide technical input into roadmap planning, timelines, and risk mitigation.
- Collaborate with security, networking, storage, and application teams.
- Document architecture, standards, and operational procedures.
6. Security & Compliance
- Implement RBAC, security policies, and compliance controls within OpenShift.
- Harden clusters according to enterprise security standards.
- Support vulnerability management and patch governance processes.
Required Qualifications
- 5+ years of experience in Linux systems engineering (RHEL preferred).
- 3+ years of hands-on experience with Red Hat OpenShift (OCP 4.x preferred).
- Proven experience building infrastructure in on-prem datacenter environments.
- Strong understanding of:
- Kubernetes architecture
- Networking (DNS, load balancing, firewalls, SDN)
- Storage (SAN, NAS, CSI drivers)
- Virtualization platforms (VMware, etc.)
- Experience with automation tools (Terraform, Ansible, GitOps).
- Strong troubleshooting and problem-solving skills.
Preferred Qualifications
- Red Hat certifications (RHCE, OpenShift Certification).
- Experience implementing OpenShift at enterprise scale (multi-cluster environments).
- Experience working in SRE-driven environments.
- Knowledge of DevOps/GitOps practices.
- Experience with monitoring tools (Prometheus, Grafana, ELK, etc.).
Job Title: Windows SRE – Vulnerability Management & PowerShell
Location: Onsite
Experience: 8+ Years
Job Summary:
Looking for a Windows SRE with strong experience in managing enterprise Windows environments, vulnerability remediation, and automation using PowerShell. The role focuses on improving system reliability, security, and operational efficiency.
Main Skills Required:
- Windows Server Administration (2016/2019/2022)
- Vulnerability Management (Qualys / Tenable / Nessus / Rapid7)
- PowerShell Scripting & Automation
- Patch Management (SCCM / WSUS / Intune)
- Active Directory & Group Policy
- SRE / Production Support Experience
- Monitoring Tools (Splunk / Datadog / Prometheus)
- Incident Management & Root Cause Analysis
- Security Hardening & Compliance (CIS / NIST)
- Cloud Exposure (Azure / AWS)
- Infrastructure Automation (Ansible / Terraform)
Job Title: Rotating Equipment Planner
Location: Baytown TX
Duration: indefinite
Rate: $50-$60 per hour DOE
Description:
Position Summary
The Rotating Equipment Planner specializes in planning, scheduling, and coordinating maintenance activities for critical rotating equipment (pumps, compressors, turbines, motors, gearboxes, cooling towers, etc.). This role prepares detailed plans for non-emergency maintenance work selected through the Risk Based Work Selection (RBWS) process, ensuring optimal equipment reliability and performance while minimizing production downtime.
Key Responsibilities
• Planning: Develop detailed work plans for rotating equipment maintenance, including precision alignments, vibration analysis, and bearing replacements with appropriate man-hour and cost estimates
• Technical Expertise: Apply specialized knowledge of rotating equipment mechanics, tolerances, and failure modes to develop effective maintenance strategies and troubleshooting procedures
• Materials Management: Ensure critical rotating equipment spare parts (bearings, seals, couplings) are properly inventoried and available; create and maintain Bills of Material
• Work Coordination: Coordinate with Contractor Management Coordinator for resource requirements; prioritize maintenance activities between crews and production teams to minimize process disruption
• Documentation & Systems: Create and maintain task lists for repetitive jobs; outline detailed work instructions with safety advice, resources, and tools; close out jobs by entering notification history
• Reliability Improvement: Collaborate with production and technical teams to establish preventive/predictive maintenance plans, including vibration monitoring programs and lubrication schedules
• Backlog Management: Review and purge backlog weekly, distributing 'ready-to-schedule' work; identify and communicate repetitive equipment problems to Asset Engineer
Required Qualifications
• High school diploma or equivalent
• 12 years of heavy industrial maintenance experience OR 7 years with an associate's degree OR 4 years with a bachelor's degree
• Certification from Vocational or Technical school in millwright or verifiable millwright experience
• Demonstrated experience in equipment planning for rotating equipment and cooling towers
• Minimum 2 years planning/scheduling experience
• In-depth knowledge with SAP-PM Maintenance Transactions and Prometheus
• Experience using Microsoft Office Products (Word, Excel, Outlook etc.)
• The eligibility to apply for and obtain a Transportation Worker Identification Credential (TWIC) within a reasonable timeframe
Physical Requirements
• Ability to climb stairs and work at heights up to 100+ feet
• Ability to climb vertical ladders
• Sufficient physical strength to perform requirements safely
• Ability to work at computer workstation for extended periods
Success Metrics Performance measured by quality of planning and meeting established KPIs
Build and scale enterprise Kafka infrastructure using Confluent Cloud and Platform across hybrid environments. Design event-driven architectures, automate deployments with Terraform/CI/CD, optimize performance, ensure security compliance, and troubleshoot distributed streaming systems at scale.
Must Have:
- 5+ years Kafka (2+ years Confluent Cloud/Platform, Kafka Connect, Schema Registry, ksqlDB)
- Expertise in hybrid cloud Kafka deployments (AWS/Azure/GCP + on-prem)
- Strong automation (Terraform, Ansible, Jenkins) and programming (Java, Python, Scala)
- Experience with monitoring/troubleshooting distributed systems (Splunk, Datadog, Prometheus)
- Security expertise (Kerberos, SSL, RBAC) and compliance knowledge (GDPR, SOC, PCI)
You'll Build: Scalable Kafka clusters • Event-driven architectures • Automated CI/CD pipelines • Observability frameworks • Secure, compliant streaming platforms.
The Aspen Group (TAG) is one of the largest and most trusted retail healthcare business support organizations in the U.S. and has supported over 20,000 healthcare professionals and team members with close to 1,500 health and wellness offices across 48 states in four distinct categories: dental care, urgent care, medical aesthetics, and animal health. Working in partnership with independent practice owners and clinicians, the team is united by a single purpose: to prove that healthcare can be better and smarter for everyone. TAG provides a comprehensive suite of centralized business support services that power the impact of five consumer-facing businesses: Aspen Dental, ClearChoice Dental Implant Centers, WellNow Urgent Care, Chapter Aesthetic Studio, and Lovet Pet Health Care. Each brand has access to a deep community of experts, tools and resources to grow their practices, and an unwavering commitment to delivering high-quality consumer healthcare experiences at scale.
As a Senior Site Reliability Engineer (SRE) at TAG – The Aspen Group, you will be responsible for ensuring the reliability, performance, and scalability of our core systems. This role involves proactively building and managing, monitoring solutions, lead incident response, and continuously optimizing system performance to exceed business objectives. We are actively integrating AI and machine learning into our operational workflows, and you will be on the front lines, leveraging intelligent automation and machine learning to build a proactive resilient infrastructure. This is an opportunity to go beyond SRE by applying cutting-edge technology to solve complex reliability challenges.
Responsibilities:
Intelligent Site Reliability Engineering:
- Design and build highly scalable and resilient systems to support our applications and services, incorporating predictive analytics to anticipate reliability risks.
- Develop and manage Service Level Objectives (SLOs) and Service Level Indicators (SLIs) using machine learning anomaly detection to ensure systems meet reliability targets.
- Drive improvements in system reliability, availability, and performance through proactive measures, automation, and intelligent failure prediction.
Advanced Observability:
- Implement and manage comprehensive monitoring and alerting solutions, integrating with intelligent observability platforms that reduce alert noise and correlate events.
- Develop and maintain dashboards and reporting tools that provide data-driven insights for actionable troubleshooting recommendations and performance optimization.
- Evaluate and integrate advanced monitoring tools and operational intelligence platforms to enhance observability and root cause identification.
Proactive Incident Management:
- Lead and participate in incident response efforts, using intelligent log analysis and automated event correlation to speed up troubleshooting and root cause identification.
- Develop and maintain incident management processes incorporating automated decision support systems to improve response times and minimize service disruptions.
- Conduct post-incident reviews, using automated pattern recognition and trend analysis to identify systemic issues and implement preventive measures.
Performance and Capacity Optimization:
- Analyze performance metrics and logs, supported by advanced observability tools, to detect bottlenecks and inefficiencies.
- Collaborate with development teams to implement automated profiling and optimization recommendations for code and infrastructure improvements.
- Perform capacity planning using machine learning forecasting models to ensure systems can handle current and future loads.
Automation and Process Improvement:
- Develop and implement automation solutions, including intelligent runbook automation, self-healing systems, and automated incident triage.
- Identify and drive process improvements by applying machine learning to operational data for continuous optimization.
- Maintain documentation that includes automation and machine learning guidelines for monitoring, incident management, and SRE best practices.
Collaboration and Communication:
- Work closely with engineering, operations, and product teams to align reliability and monitoring goals, including automation adoption strategies.
- Communicate effectively with stakeholders, providing regular updates on system health, incidents, performance improvements, and data-driven insights.
- Foster a culture of collaboration, knowledge sharing, and automation best practices within the team and across the organization.
Requirements:
- Bachelor's degree in computer science or a related technical field.
- At least 5 years of experience in Site Reliability Engineering or a similar role.
- Strong proficiency in at least one programming language such as Python, Go, or C#
- Demonstrated experience applying machine learning and automation to operational workflows such as monitoring, alerting and incident response.
- Expertise with infrastructure as code tools such as Terraform
- Proven experience working and monitoring container environments such as Cloud Run and Kubernetes.
- Hands-on experience using and working within an Azure, AWS, and GCP environment (GCP preferred)
- Strong understanding of networking, distributed systems, and cloud infrastructure.
- Familiarity with intelligent monitoring platforms and operational analytics tools such as Prometheus, Grafana, OpenSearch, Sentry, Google Cloud Observability
- Excellent problem-solving skills and the ability to work independently and as part of a team.
- Experience with incident management, root cause analysis, and automated operational workflows.
Annual pay range: $129,000-$160,000
A generous benefits package that includes paid time off, health, dental, vision, and 401(k) savings plan with match
Our Ideal Candidate
We are seeking an experienced cloud and DevOps engineer with over 5 years of experience designing, automating, and maintaining scalable AWS infrastructure, CI/CD pipelines, and secure cloud environments. In the role of Senior Cloud Platform Engineer, you should demonstrate expertise in Infrastructure as Code, scripting, containerization, and modern monitoring or alerting platforms, as well as strong skills working across teams. Success in this position requires a talent for optimizing cloud resources, ensuring security and compliance, and facilitating fast, reliable software deployments. Having experience with HIPAA-compliant systems, .NET platforms, or serverless computing is considered a significant advantage.
Responsibilities
- Design, implement, and maintain CI/CD pipelines using tools like AWS CDK, AWS CodePipeline, or GitHub Actions.
- Manage infrastructure as code (IaC) using Terraform, CloudFormation, or similar tools.
- Monitor system performance and availability using tools like CloudWatch, Prometheus, Grafana, or Datadog.
- Automate repetitive tasks and deployment processes to improve team efficiency.
- Collaborate with software engineers, QA, and product teams to ensure smooth deployments and rapid iteration.
- Implement and enforce security best practices and compliance across infrastructure and deployment pipelines.
- Identify optimizations to reduce cloud resource usage across AWS accounts.
- Maintain documentation for infrastructure, processes, and compliance requirements.
- Work with multiple teams to implement their deployments using common practices.
- Manage Builds and the corresponding documentation
- Monitor package versions, track EOL dates, and upgrade to keep infrastructure current
Qualifications
- B.S. Computer Science degree or equivalent experience.
- 5+ years of experience in DevOps, Site Reliability Engineering, or related roles.
- 2+ years of hands-on AWS Experience
- Strong experience with cloud platforms (AWS, Azure, or GCP).
- Proficiency in scripting languages such as Bash, Python, or PowerShell.
- Experience with containerization and orchestration (Docker, Kubernetes).
- Familiarity with monitoring, logging, and alerting tools.
- Solid understanding of networking, security, and system administration.
- Strong communication skills and ability to work cross-functionally.