Prometheus Relabelconfigs Drop Example Jobs

E

Director Information Technology Architecture

Salary not disclosed

Boston, MA 6 days ago

We're Hiring: Director of IT Architecture (Remote, with onsite meetings as needed)

We are seeking a Director of IT Architecture | Enterprise Architecture | Cloud & Systems Leader to lead and shape the IT architecture strategy for a growing healthcare organization. This is a unique opportunity to design and implement technology solutions that support business goals, regulatory compliance, and modern healthcare delivery.

Key Responsibilities:

Define and execute a comprehensive IT architecture strategy aligned with clinical, operational, and business objectives.
Lead and manage a team of network, cloud, and systems architects, fostering collaboration and high performance.
Oversee network, cloud, and systems architecture initiatives, ensuring security, scalability, and interoperability.
Evaluate, test, and implement modern platform visibility solutions (DataDog, Dynatrace, New Relic, Prometheus / Grafana).
Collaborate with IT leadership, business stakeholders, vendors, and cloud providers to optimize technology investments.
Establish IT governance, standards, and best practices to ensure compliance with industry regulations (HIPAA, HITECH, HITRUST).
Monitor performance, risks, and cost optimization across all IT architecture initiatives.

Required Qualifications:

Bachelor’s degree in Computer Science, IT, Healthcare Informatics, or related field.
10+ years of progressive experience in IT architecture, including at least 5 years in a leadership role managing network, cloud, and systems architecture teams, preferably in healthcare.
Hands-on experience with cloud platforms (AWS, Azure) and hybrid environments.
Demonstrated history of assessing, testing, and implementing modern platform visibility solutions (DataDog, Dynatrace, New Relic, Prometheus / Grafana).
Strong expertise in network architecture (SD-WAN, VPNs, firewalls, healthcare data exchange networks).
Deep knowledge of systems architecture, including server infrastructure, virtualization, storage, disaster recovery, and healthcare IT standards (HL7, FHIR, DICOM).
Strong leadership, communication, and stakeholder management skills.
Strategic thinker with strong problem-solving and analytical abilities.

Preferred Qualifications:

Master’s degree in a related field.
Relevant cloud certifications (AWS Solutions Architect Professional, AWS Security Specialty, Microsoft Azure Solutions Architect).
Security and architecture certifications such as CISSP, CCNP, HITRUST, or FinOps.

This is a remote role with the flexibility to work from home, while requiring occasional onsite meetings for leadership collaboration and strategic planning.

If you are a visionary IT leader with a strong healthcare background, experience leading cloud, network, and systems architecture teams, and a passion for building scalable, secure IT platforms, we want to hear from you!

Not Specified

T

Lead Enterprise Tooling Engineer

🏢 Tenant Inc.

Salary not disclosed

Irvine, CA 6 days ago

Lead Enterprise Tooling Engineer — Tenant Inc.

Overview

Tenant Inc. is modernizing its enterprise tooling, automation, and visibility ecosystem to better support our engineering, operations, finance, sales, and customer support teams. The Lead Enterprise Tooling Engineer plays a critical role in this transformation by owning the strategy, architecture, and execution of integrations across Jira, Microsoft 365, HubSpot, Zendesk, Intuit Enterprise, ERP systems, and internal platforms. This role ensures that our business systems work together seamlessly, data flows reliably across the organization, and leaders have a unified view of operational performance.

By connecting enterprise tools with application telemetry and APM insights, this position enables a single source of truth for workflow health, customer impact, and cross-system reliability. The ideal candidate blends technical expertise with business acumen, ensuring that tooling investments directly support Tenant’s operational goals and modernization roadmap.

Key Responsibilities

Enterprise Tooling Architecture & Integration

• Design and maintain the integrations that connect our core business systems, ensuring information flows consistently across Jira, Microsoft 365, HubSpot, Zendesk, Intuit Enterprise, ERP platforms, and internal applications.

• Build automated workflows and API-driven processes that reduce manual effort, eliminate redundant work, and improve data accuracy.

• Lead the unification of identity, permissions, and user lifecycle management across enterprise tools to support operational efficiency and compliance.

• Oversee cross-platform data synchronization for contacts, leases, tickets, financial data, and operational workflows to ensure a consistent and reliable customer and business experience.

APM, Observability & Unified Visibility

• Integrate observability and APM platforms (OpenSearch, Prometheus, Grafana, New Relic, Catchpoint, CloudWatch, clickstream analytics) with enterprise systems to provide end-to-end visibility across the business.

• Connect system telemetry with business workflows—linking application performance to Jira issues, Zendesk tickets, HubSpot activities, and ERP events.

• Develop executive-ready dashboards that consolidate operational KPIs, workflow performance, integration health, and customer impact into a single pane of glass.

• Implement alerting and automated correlation to help teams identify issues faster and understand their business implications.

• Partner with DevOps and SRE to ensure observability data is actionable and accessible across the organization.

Workflow Automation & Process Optimization

• Design automated workflows that streamline processes across engineering, support, sales, finance, and operations.

• Build Jira workflows, dashboards, and governance structures that support predictable releases and cross-team alignment.

• Automate HubSpot → Jira → Zendesk → ERP workflows to reduce handoffs, shorten cycle times, and improve customer responsiveness.

• Partner with Finance to automate Intuit Enterprise and ERP processes such as invoicing, reconciliation, and reporting.

API Engineering & Custom Development

• Develop and maintain custom integrations, middleware, and internal tools that improve operational efficiency and reduce manual work.

• Implement reliable error handling, monitoring, and logging to ensure integrations remain stable and transparent.

• Ensure all integrations meet security, scalability, and compliance requirements.

Data Quality, Governance & Observability

• Establish data governance standards that ensure accuracy, consistency, and auditability across enterprise tools.

• Implement monitoring and alerting for integration health and workflow performance.

• Partner with Security and Compliance to maintain SOC2, PCI, and internal governance standards.

Cross-Functional Leadership & Collaboration

• Serve as the strategic and technical leader for enterprise tooling, automation, and observability initiatives.

• Partner with Engineering, Product, Support, Sales, Finance, and Operations to understand business needs and translate them into scalable solutions.

• Mentor engineers and administrators across Jira, HubSpot, Zendesk, and Microsoft 365.

• Promote best practices for automation, documentation, and cross-system reliability.

Operational Excellence

• Lead root cause analysis for integration and workflow issues, ensuring long-term solutions rather than short-term fixes.

• Reduce manual effort across departments through automation and improved tooling.

• Maintain clear documentation for integrations, workflows, and system dependencies.

• Evaluate new tools, vendors, and opportunities to improve operational efficiency and business outcomes.

Required Qualifications

• 7+ years in enterprise tooling, business systems engineering, DevOps, or integration engineering.

• Deep experience with APIs for Jira, Microsoft 365, PowerBI, HubSpot, Zendesk, and similar SaaS platforms.

• Hands-on experience with observability and APM platforms (OpenSearch, Prometheus, Grafana, New Relic, Catchpoint, CloudWatch, clickstream analytics).

• Strong scripting and automation skills (Python, Node.js, PowerShell).

• Experience designing workflow automation across multiple business systems.

• Strong understanding of identity management, SSO, and permission models.

• Experience with data governance, monitoring, and integration reliability.

• Proven ability to lead cross-functional initiatives and collaborate with business stakeholders.

Preferred Qualifications

• Experience with Intuit Enterprise, ERP systems, or financial system integrations.

• Background in multi-tenant SaaS environments.

• Experience improving customer experience through event-driven architectures (webhooks, queues, EventBridge, SNS/SQS).

• Familiarity with ETL pipelines, data warehousing, and analytics platforms.

• Experience supporting engineering release workflows and IT DevOps processes.

Success Indicators at Tenant Inc.

• A unified, executive-ready view of operational performance that connects APM telemetry, enterprise workflows, and business outcomes.

• Automated, reliable workflows across Jira, HubSpot, Zendesk, Microsoft 365, and ERP systems.

• Significant reduction in manual work across engineering, support, sales, and finance.

• Clean, consistent, and governed data across enterprise tools.

• Reliable integrations with clear dashboards, alerting, and business impact visibility.

• Strong cross-team alignment and measurable improvements in operational efficiency.

• A scalable, well-documented tooling architecture that supports Tenant’s modernization strategy.

#EnterpriseEngineering #BusinessSystems #ToolingEngineering #AutomationEngineering

#SystemsIntegration #APM #Observability

Not Specified

G

Site Reliability Engineer

✦ New

🏢 Galaxy i technologies Inc

Salary not disclosed

Austin, TX 1 day ago

Job Title: Site Reliability Engineer (SRE) – DataHub & GraphQL

Location: Austin, TX & Sunnyvale, CA '

Looking For Only Independent Visa

Role Overview

We are seeking a highly skilled Site Reliability Engineer (SRE) with strong expertise in DataHub ingestion pipelines and GraphQL APIs. The ideal candidate will be responsible for designing, building, and maintaining scalable data ingestion frameworks, ensuring reliability and performance of enterprise data platforms, and enabling seamless integration with downstream applications. This role requires a balance of software engineering, systems reliability, and data platform knowledge.

Key Responsibilities

Design, implement, and optimize DataHub ingestion pipelines for large-scale enterprise data systems.
Develop and maintain GraphQL APIs to support data discovery, metadata management, and integration.
Ensure high availability, scalability, and performance of data services across cloud and on-prem environments.
Collaborate with data engineering, product, and infrastructure teams to deliver reliable data solutions.
Automate monitoring, alerting, and incident response processes to improve system resilience.
Drive best practices in observability, logging, and distributed system reliability.
Troubleshoot complex production issues and implement long-term fixes.

Must-Have Skills

5+ years of experience as an SRE, DevOps Engineer, or Software Engineer with a focus on reliability and scalability.
Strong hands-on experience with DataHub ingestion frameworks and metadata pipelines.
Proficiency in GraphQL API design and implementation.
Solid understanding of cloud platforms (AWS, GCP, or Azure) and container orchestration (Kubernetes, Docker).
Expertise in monitoring tools (Prometheus, Grafana, ELK, Datadog, etc.).
Strong programming skills in Python, Java, or Go.
Experience with CI/CD pipelines and infrastructure-as-code (Terraform, Ansible).

Good-to-Have Skills

Familiarity with data governance and metadata management tools.
Experience integrating with data platforms like Kafka, Spark, or Snowflake.
Knowledge of REST APIs and microservices architecture.
Exposure to security and compliance practices in data systems.

Qualifications

Bachelor’s or Master’s degree in Computer Science, Engineering, or related field.
Proven track record of delivering reliable, scalable data infrastructure solutions.

Not Specified

Cassandra Database Engineer/Administrator

✦ New

🏢 Mindlance

Salary not disclosed

Beaverton, OR 1 hour ago

Location: Database Engineer

Duration: 11-12 months

Location : Austin , TX ( 78759) Hybrid role - In office Mon, Wed, Thurs is a must. (No flexibility on these days)

Job Description:

The Cassandra Database Engineer is an expert across NOSQL database technologies, but specifically a specialist on Cassandra database administration.

For this position, NOSQL database expertise is mandatory with a primary focus on Cassandra databases, as well as expertise in Public Cloud technology (AWS and/or GCP).

For this mission, the engineer will primarily be responsible for database operational activities.

Essential Functions / Key Areas of Responsibility

The Database Engineer primary responsibility footprint:

· Database performance analysis and operations review for production database platforms

· Manage database operations activities including incident response, database alert resolution, and managing third party support engagement

· Deploy and maintain database monitoring solutions.

· Test and build database restore and recovery procedures

· Database platform deployment, installation, patching, change management, and third-party software upgrades.

· Responsible for database hardening procedure identification and deployment on public cloud, hosted, and on-premises platforms.

· Responsible for providing database expertise and operations support to the technical support teams and project delivery teams.

· Responsible for participating in database platform review, bench and tuning exercises, security evaluation, provide technical analysis and proactive recommendations for improvements and/or design changes for production platforms

Minimum Requirements: Skills, Experience & Education

· HS diploma with 8+ experience in Cassandra administration (NOT architecture or design)

· College degree in Computer Science preferred + 8-10 years’ experience

· NOSQL Database: 8-10 years Cassandra administration

· Extensive background with public cloud database deployment, management and migration.

· Expertise in database concepts, defining standards, processes, and procedures in database deployment methodologies

· Expert in operations of high-profile production database platforms with high SLA and high-performance expectation

· High level of experience in managing change on production database platform on hosted, on premise, and cloud database platforms

· Expert in deploying high availability database architectures

· Proactive, team player, and leadership qualities with strong technical background

· Excellent verbal and written communication skills

Preferred Qualifications

· Highly skilled in Cassandra database administration

· DataStax enterprise Cassandra administration a plus

· Strong production operations and troubleshooting skills

· Linux operating system background

· Skilled in Public Cloud deployment methods/tools (Gitlab, Terraform, Datadog)

· Knowledge of Kubernetes and Docker.

· Database performance evaluation and platform bench participation

Special Position Requirements:

Candidate will need to be able to multitask and quickly switch if needed to work on emergency incidents on production platforms. The position requires the ability to be able to manage tight deadlines and have visibility on project delivery goals and the ability to communicate effectively to project teams and management. The candidate will be able to thrive in fast paced work environment.

Looking for a candidate that is currently in the position of maintaining Cassandra clusters today (avoid those that have worked in past, or a couple years ago...)
How many clusters are maintained today
How many nodes
What Cassandra version are they
How many years have you worked on Cassandra (ideally 5+)
Candidate has operations experience and can speak to challenges in his environment today
manages patching / upgrades
is called upon in crisis to manage
delivers new environments
Performance tuning experience with Cassandra
familiar with backup and recovery
Familiar with monitoring Cassandra (Prometheus or Datadog a plus)
is go to for other teams on Cassandra database topics
Candidate is adaptable to work in fast paced environment, context switching is normal
Candidate is ok to be in stressful/challenging situations
Outages
Crises team
War room

Not Specified

S

Site Reliability Engineer II

🏢 Spectraforce Technologies

Salary not disclosed

Alpharetta, GA 3 days ago

Title: Site Reliability Engineer II

Location: Alpharetta, GA (3 days a week onsite)

Duration: 6 months

Job Description:

We are seeking a skilled Site Reliability Engineer to join our team and help build, maintain, and scale our cloud-native infrastructure. You will work closely with development and operations teams to ensure our systems are reliable, scalable, and efficient. The ideal candidate is passionate about automation, observability, and infrastructure-as-code, and thrives in a collaborative, fast-paced environment.

Key Responsibilities

Design, implement, and manage cloud infrastructure on Azure using Terraform and Terragrunt.
Maintain and optimize Kubernetes clusters on Azure Kubernetes Service (AKS).
Build and manage CI/CD pipelines using GitHub Actions/Workflows and ArgoCD for GitOps deployments.
Enhance system reliability by implementing monitoring, alerting, and observability solutions with Grafana.
Automate operational tasks to reduce toil and improve team efficiency.
Participate in on-call rotations, incident response, and post-mortem analysis.
Collaborate with development teams to improve application performance, scalability, and resilience.
Implement and advocate for SRE best practices, including SLIs, SLOs, and error budgets.
Continuously improve system performance, cost efficiency, and security.

Required Skills & Qualifications

3+ years of experience in an SRE, DevOps, or cloud infrastructure role.
Strong experience with Azure cloud services and infrastructure.
Hands-on experience with java and Terraform and Terragrunt for infrastructure-as-code.
Proficiency with Kubernetes (preferably AKS and container orchestration.
Experience with CI/CD tools, especially GitHub Workflows/Actions and ArgoCD.
Solid understanding of observability tools like Grafana (Prometheus, Loki, Tempo experience is a plus).

Education Requirements Bachelor's degree required, (Masters preferred)

Not Specified

P

Staff Software Engineer, Observability

🏢 Pinterest

Salary not disclosed

San Francisco, CA 3 days ago

About Pinterest:

Millions of people around the world come to our platform to find creative ideas, dream about new possibilities and plan for memories that will last a lifetime. At Pinterest, we're on a mission to bring everyone the inspiration to create a life they love, and that starts with the people behind the product.

Discover a career where you ignite innovation for millions, transform passion into growth opportunities, celebrate each other's unique experiences and embrace theflexibility to do your best work. Creating a career you love? It's Possible.

At Pinterest, AI isn't just a feature, it's a powerful partner that augments our creativity and amplifies our impact, and we're looking for candidates who are excited to be a part of that. To get a complete picture of your experience and abilities, we'll explore your foundational skills and how you collaborate with AI.

Through our interview process, what matters most is that you can always explain your approach, showing us not just what you know, but how you think. You can read more about our AI interview philosophy and how we use AI in our recruiting process here.

We're seeking an exceptional Staff Software Engineer to join our Observability team at Pinterest. This role combines deep technical expertise in distributed systems and data engineering with a product-oriented mindset to build world-class observability solutions that empower our engineering organization.As a Staff Engineer on the Observability team, you'll be responsible for designing and building the infrastructure and tools that provide visibility into Pinterest's large-scale distributed systems, helping thousands of engineers understand, debug, and optimize their services.

What you'll do:

Define and execute the observability roadmap, treating it as a product. Understand engineering team needs and translate them into technical solutions with measurable impact.
Architect, build, and scale distributed observability infrastructure (metrics, logs, traces) to handle massive volumes across Pinterest's distributed systems.
Build high-performance data pipelines and storage for real-time and historical telemetry analysis at Pinterest scale.
Champion Best Practices: Establish observability standards and patterns across the organization, making it easy for teams to instrument their services and gain actionable insights
Technical Leadership: Mentor engineers, lead architectural reviews, and influence technical decisions across teams to improve overall system reliability and performance
Cross-functional Collaboration: Partner with SRE, Infrastructure, Product Engineering, and other teams to understand pain points and deliver solutions that improve developer productivity and system reliability
Innovation: Stay current with observability trends and technologies, evaluating and adopting cutting-edge tools and techniques to keep Pinterest at the forefront

What we're looking for:

Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent experience.
Product Mindset: Demonstrated ability to work backwards from customer needs -understanding user needs, prioritizing features, measuring success, and iterating based on feedback. Experience building internal platforms or tools with strong adoption
Distributed Systems Expertise: 7+ years of experience designing and operating large-scale distributed systems with deep understanding of consistency, availability, scalability, and failure modes
Data Engineering Skills: Strong background in building data pipelines, working with time-series databases, columnar storage, stream processing (Kafka, Flink, etc.), and data modeling at scale
Observability Domain Knowledge: Hands-on experience with modern observability tools and practices including metrics, logging, tracing, and profiling. Familiarity with OpenTelemetry, Prometheus, Grafana, or similar technologies
Programming Proficiency: Expert-level coding skills in languages like Java, Python, Go, or Scala with ability to write production-quality code
Systems Thinking: Ability to see the big picture while managing complex technical details, balancing trade-offs between cost, performance, and reliability
Experience building observability platforms from the ground up or significantly scaling existing solutions
Familiarity with cloud-native architectures and technologies (Kubernetes, service mesh, etc.)
Track record of driving adoption of internal platforms through excellent documentation, UX, and developer advocacy
Experience with machine learning or anomaly detection applied to observability use cases
Strong communication skills with ability to influence stakeholders at all levels
Contributions to open-source observability projects, a plus

In-Office Requirement Statement:

We let the type of work you do guide the collaboration style. That means we're not always working in an office, but we continue to gather for key moments of collaboration and connection.
This role will need to be in the office for in-person collaboration 1-2 times/quarter and therefore can be situated anywhere in the country.

Relocation Statement:

This position is not eligible for relocation assistance. Visit our PinFlex page to learn more about our working model.

#LI-REMOTE

#LI-JT1

At Pinterest we believe the workplace should be equitable, inclusive, and inspiring for every employee. In an effort to provide greater transparency, we are sharing the base salary range for this position. The position is also eligible for equity. Final salary is based on a number of factors including location, travel, relevant prior experience, or particular skills and expertise.

Information regarding the culture at Pinterest and benefits available for this position can be found here.

US based applicants only$177,185—$364,795 USD

Our Commitment to Inclusion:

Pinterest is an equal opportunity employer and makes employment decisions on the basis of merit. We want to have the best qualified people in every job. All qualified applicants will receive consideration for employment without regard to race, color, ancestry, national origin, religion or religious creed, sex (including pregnancy, childbirth, or related medical conditions), sexual orientation, gender, gender identity, gender expression, age, marital status, status as a protected veteran, physical or mental disability, medical condition, genetic information or characteristics (or those of a family member) or any other consideration made unlawful by applicable federal, state or local laws. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. If you require a medical or religious accommodation during the job application process, please completethis formfor support.

Not Specified

S

REO Resiliency Engineering and Quality Leader (Hybrid)

✦ New

🏢 Securian

Salary not disclosed

Saint Paul, MN, Hybrid 1 day ago

*At Securian Financial the internal position title is Infrastructure Dir."

Mission

"To lead the engineering discipline that ensures Securian's technology platforms and cloud services are built and operated with uncompromising resilience, performance, and quality. This role drives the design and automation of fault-tolerant, high-availability architectures across AWS, Azure, and GCP-ensuring the enterprise meets resiliency, scalability, and efficiency expectations at every layer of technology."

Positioning

The Director of Resilience Engineering and Quality Leader is both a strategic peer and technical counterpart to the Infrastructure & Reliability Engineering Leader.

This role provides bench depth and succession coverage for REO's most technically complex domains while driving innovation in reliability, resilience, and performance practices.

Strategic influence: Shapes cloud reliability, quality engineering, and resilience strategy across REO and Architecture domains.
Operational authority: Leads Sr. Managers and Managers who own the execution of quality, resilience, and performance engineering capabilities.
Enterprise collaboration: Works hand-in-hand with Technology, Solution, Business, Data, and Enterprise Architects to embed reliability and resilience as core architecture principles.

Scope of Accountability

Resilience Engineering & Cloud Reliability

Architect and validate fault-tolerant, regionally resilient architectures across AWS, Azure, and GCP.
Own resilience automation, chaos testing, and IaC-based recovery validation.
Lead cross-cloud reliability design reviews and failure-mode analyses for critical systems.

Quality Engineering & Continuous Testing

Define enterprise-wide quality engineering strategy integrated into CI/CD pipelines.
Drive automation-first testing (functional, non-functional, performance, resilience).
Embed observability-driven quality validation and contract testing across services.

Performance, Capacity & Efficiency Engineering

Oversee predictive capacity planning, scaling automation, and cost/efficiency optimization (FinOps/GreenOps).
Partner with Platform & Infrastructure teams to tune performance across application and platform layers.
Measure and report on performance SLIs/SLAs aligned to REO's Reliability Metrics framework.

Cross-Domain Architecture Collaboration

Partner with Enterprise Architects to codify resilience and reliability standards in technology blueprints.
Collaborate with Technology & Solution Architects to design service reliability into delivery architectures.
Engage Data Architects for data resilience, replication, and pipeline reliability.
Work with Business Architects to align technical reliability goals with critical business outcomes.

Leadership & Talent Development

Lead a team of Sr. Managers and Managers, fostering a high-performance, hands-on engineering culture.
Build and mentor top-tier technical talent in cloud reliability, resilience, and quality automation.
Partner with HR and REO Enablement to develop succession plans and technical competency frameworks.

Core Technical Competencies

AWS (primary) - Multi-account design, HA architecture, region failover, resilience automation, Terraform/CDK/CloudFormation.
Azure & GCP (secondary) - Compute, networking, and reliability constructs; hybrid cloud design and failover integration.
Infrastructure as Code (IaC) - Deep proficiency in Terraform, policy-as-code (OPA/Conftest), drift detection, pipeline integration.
Reliability & Chaos Engineering - AWS Fault Injection Simulator, Gremlin, steady-state hypothesis design.
Observability & Quality Automation - OpenTelemetry, Prometheus, CloudWatch, K6, Gatling; CI/CD quality gates and dashboards.
Performance Engineering - Load, stress, and soak testing automation; performance profiling and SLO alignment.
Disaster Recovery Automation - Cross-region orchestration, IaC-driven DR runs, replication validation.
FinOps/GreenOps - Cloud cost and efficiency automation, carbon-aware scaling policies.

Leadership Competencies

Strategic Technical Leadership: Operates at the intersection of deep engineering and executive strategy.
Multi-Domain Collaborator: Integrates reliability and resilience across architecture, operations, and business domains.
Talent Multiplier: Develops and empowers senior managers, fostering engineering mastery and innovation.
Credible Technical Authority: Trusted peer to Infrastructure & Reliability Engineering; capable of leading architecture reviews and executive briefings.
Change Champion: Drives transformation of reliability practices across platforms, pipelines, and teams.

Qualifications & Experience

12+ years in cloud engineering, reliability, or platform leadership roles.
5+ years leading Sr. Managers/Managers in technical domains.
Proven expertise across AWS, with working knowledge of Azure and GCP.
Experience with multi-cloud governance, DR design, IaC at scale, and reliability automation.
Strong understanding of observability, SRE principles, and REO/ITIL-aligned reliability frameworks.
Certifications:
- Required: AWS Certified Solutions Architect - Professional
- Preferred: AWS DevOps Engineer, Azure Solutions Architect Expert, Google Professional Cloud Architect

Success Metrics

99.9% availability maintained for Tier-1 workloads.
100% coverage of DR automation for Tier-1 services.
25% annual increase in automated quality/test coverage.
15% annual improvement in resource efficiency and cost performance.
Documented resilience participation across all enterprise architecture blueprints.
Positive "technical peer readiness" and succession rating from Head of REO.

Summary Value Proposition

This Director role blends deep AWS reliability engineering expertise, multi-cloud technical breadth, and leadership scale.

It ensures REO maintains both technical depth and leadership redundancy, and it strengthens the bridge between engineering execution and enterprise architecture alignment.

#LI-hybrid **This position will be in a hybrid working arrangement.**

Securian Financial believes in hybrid work as an integral part of our culture. Associates get the benefit of working both virtually and in our offices. If you're in a commutable distance (90 minutes), you'll join us 3 days each week in our offices to collaborate and build relationships. Our policy allows flexibility for the reality of business and personal schedules.

The estimated base pay range for this job is:

$145,000.00 - $267,000.00

Pay may vary depending on job-related factors and individual experience, skills, knowledge, etc. More information on base pay and incentive pay (if applicable) can be discussed with a member of the Securian Financial Talent Acquisition team.

Be you. With us. At Securian Financial, we understand that attracting top talent means offering more than just a job - it means providing a rewarding and fulfilling career. As a valued member of our high-performing team, we want you to connect with your work, your relationships and your community. Enjoy our comprehensive range of benefits designed to enhance your professional growth, well-being and work-life balance, including the advantages listed here:

Paid time off:

We want you to take time off for what matters most to you. Our PTO program provides flexibility for associates to take meaningful time away from work to relax, recharge and spend time doing what's important to them. And Securian Financial rewards associates for their service by providing additional PTO the longer you stay at Securian.
Leave programs: Securian's flexible leave programs allow time off from work for parental leave, caregiver leave for family members, bereavement and military leave.
Holidays: Securian provides nine company paid holidays.

Company-funded pension plan and a 401(k) retirement plan: Share in the success of our company. Securian's 401(k) company contribution is tied to our performance up to 10 percent of eligible earnings, with a target of 5 percent. The amount is based on company results compared to goals related to earnings, sales and service.

Health insurance: From the first day of employment, associates and their eligible family members - including spouses, domestic partners and children - are eligible for medical, dental and vision coverage.

Volunteer time: We know the importance of community. Through company-sponsored events, volunteer paid time off, a dollar-for-dollar matching gift program and more, we encourage you to support organizations important to you.

Associate Resource Groups: Build connections, be yourself and develop meaningful relationships at work through associate-led ARGs. Dedicated groups focus on a variety of interests and affinities, including:

Mental Wellness and Disability
Pride at Securian Financial
Securian Young Professionals Network
Securian Multicultural Network
Securian Women and Allies Network
Servicemember Associate Resource Group

For more information regarding Securian's benefits, please review our Benefits page.

This information is not intended to explain all the provisions of coverage available under these plans. In all cases, the plan document dictates coverage and provisions.

Securian Financial Group, Inc. does not discriminate based on race, color, religion, national origin, sex, gender, gender identity, sexual orientation, age, marital or familial status, pregnancy, disability, genetic information, political affiliation, veteran status, status in regard to public assistance or any other protected status. If you are a job seeker with a disability and require an accommodation to apply for one of our jobs, please contact us by email at , by telephone (voice), or 711 (Relay/TTY).

To view our privacy statement click here

To view our legal statement click here

Remote working/work at home options are available for this role.

Not Specified

Y

W2 Role: Senior Site Reliability Engineer

✦ New

🏢 Yochana

Salary not disclosed

Charlotte, North Carolina 8 hours ago

Job Title : Senior Site Reliability Engineer

Location : Charlotte, NC/ Columbus, OH – Hybrid (3 days onsite a week)

Duration : Contract role (W2)

In-person Interview required in NJ or NC on 21st Saturday March

Job Description:

Tech Stack: Java/J2EE (Spring, Spring Boot, Python, Shell Scripting, Kafka, Oracle, MongoDB etc.).

10+ years of Software Engineering experience
5+ years of experience in Site Reliability Engineering teams with continued focus on improving Platform health
Familiar with Agile or other rapid application development practices
Hands-on expertise in building dashboards using APM tools.
Experience with distributed (multi-tiered) systems, algorithms, relational databases, and NoSQL databases.
Knowledge & Exposure caching tools (Redis, memcache) or messaging tools such as MQ, Kafka.
Must have working knowledge of APM tools such as splunk, GCL, ELK, Grafana, Prometheus etc.
Able to create Dashboards using GCL/Splunk/ELK and setup alerts.
Working knowledge of CICD is a plus – Source control like Git, Continuous Integration – Jenkins / UCD Release etc. .
Ability to work with Engineering teams across the ecosystem such as Security, Networking & Infrastructure challenges which can impact platform health & resiliency.
Shell Scripting / DevOps tools like Ansible with good knowledge of yaml file to write playbooks .
Experience with distributed storage technologies like NFS as well as dynamic resource management frameworks PCF, Kubernetes / OpenShift, AWS or Azure.
A proactive approach to spotting problems, areas for improvement, and performance bottlenecks

Not Specified

V

RedHat OpenShift & Kubernetes SME

✦ New

🏢 VDart

Salary not disclosed

Princeton, New Jersey 8 hours ago

Job Title: RedHat OpenShift & Kubernetes SME

Location: Princeton - NJ - 08540

Mode : Contract (6+ Months) – Onsite

Min 15 Years of experience required.

Qualifications:

Design, deploy and maintain Red Hat OpenShift and Rancher Managed Kubernetes Clusters

Architect Highly available, scalable, and secure container platforms

Install, configure, upgrade and patch OpenShift and Rancher Clusters

Implement logging, monitoring, and alerting (Prometheus, Grafana, EFK etc.)

Troubleshoot Cluster, Networking, Storage, and application issues

Perform root cause analysis and provide performance optimization

Act as an SME for OpenShift and Rancher Technologies

Provide guidance to Customer and application teams

Create documentation, standards, and operational runbooks

Strong Hands-on experience with RedHat Open Shift and Rancher (RKE, RKE2)

Expert knowledge of Kubernetes architecture and Operations

Experience supporting mixed OS environments (Windows and Linux).

Excellent communication skills, able to explain complex concepts to technical and non-technical audiences.

Demonstrated ability to work independently and as part of a team.

Relevant certifications (RHCA, CKA, CKAD, etc.) and active participation in the Kubernetes community are a plus.

Experience with CI/CD Pipelines

Not Specified

Agentic AI Engineer

✦ New

🏢 Unisys

Salary not disclosed

Rockville, Maryland 8 hours ago

Overview

Architects and builds the infrastructure and tooling that powers AI agent development across the Software Development Lifecycle (SDLC). Develops production-grade agentic systems, orchestration frameworks, and observability solutions that enable teams to build, deploy, and monitor reliable AI agents at scale. Plays a key role in defining and implementing the next generation of SDLC through AI-first innovation and comprehensive instrumentation.

What We're Looking For

You demonstrate sharp product sense for high-impact automation opportunities, technical taste in implementation decisions, and the ability to clearly articulate trade-offs. You know when to apply AI agent solutions versus simpler approaches and can explain the \"why\" behind architectural choices.

You excel at 0-to-1 (and 1-to-100) product development, comfortable operating in ambiguous environments where requirements emerge through experimentation and iteration rather than upfront specification.

Key Responsibilities

AI Agent Development & Automation:

• Develop production-grade AI agents that eliminate manual handoffs across the SDLC

• Create custom integrations and CLI tools that give agents deep understanding of internal systems and codebases

• Design comprehensive testing strategies to ensure agent reliability and output quality

• Implement \"Golden Path\" scaffolding that embeds organizational standards into new projects

• Build AI solutions that improve codebase navigation, documentation, and developer workflows

• Identify workflow bottlenecks and deliver measurable impact through intelligent automation

• Shape SDLC evolution by identifying AI-first opportunities and proving outcomes through experimentation

Agent Infrastructure & Platform:

• Architect and maintain production infrastructure supporting agent deployment, lifecycle management, and scaling

• Develop agent frameworks, templates, and SDKs that accelerate agent development

• Create governed Model Context Protocol (MCP) catalog enabling compliant agent-to-agent and agent-to-MCP communication

• Implement governance controls for agent behavior, permissions, and system access

Observability & Performance Analytics:

• Design and implement metrics, monitoring, and logging infrastructure for AI agents and development workflows

• Build dashboards that provide actionable insights into developer productivity, tool adoption, and agent performance

• Establish KPIs and measurement frameworks to quantify the impact of AI-powered automation

• Create alerting and anomaly detection systems to ensure reliability of agents and tooling

• Analyze telemetry data to identify optimization opportunities and guide strategic investment decisions

Collaboration & Impact:

• Partner across teams to drive adoption of AI-powered tooling and process transformation

• Stay current with LLM technologies and coach colleagues on AI-assisted development and automation best practices

• Rapidly prototype solutions to validate use cases and prove value quickly

• Communicate data-driven insights to stakeholders through clear visualizations and reports

Preferred Qualifications:

• 5-7+ years of software engineering experience building production systems

• Proven experience building agentic systems using LLM orchestration frameworks

• Hands-on expertise with AI-powered development tools (code assistants, AI-enhanced editors)

• Strong foundation in SDLC, system design, and internal tooling development

• Experience with observability tools and practices including metrics collection, logging frameworks, and dashboard development

• Full-stack technical proficiency:

• Languages: Java, Python, JavaScript/TypeScript

• Frameworks: Angular, Spring Boot

• CI/CD platforms and cloud infrastructure (AWS)

• Monitoring/observability tools (e.g., Prometheus, Grafana, CloudWatch)

• Passion for transforming software development through AI innovation and data-driven decision making

# LI-CGTS

# TS-2505

Not Specified

Prometheus Relabelconfigs Drop Example Jobs in Usa