Prometheus Label Example Jobs in Usa
1,609 positions found — Page 4
*At Securian Financial the internal position title is Infrastructure Dir."
Mission
"To lead the engineering discipline that ensures Securian's technology platforms and cloud services are built and operated with uncompromising resilience, performance, and quality. This role drives the design and automation of fault-tolerant, high-availability architectures across AWS, Azure, and GCP-ensuring the enterprise meets resiliency, scalability, and efficiency expectations at every layer of technology."
Positioning
The Director of Resilience Engineering and Quality Leader is both a strategic peer and technical counterpart to the Infrastructure & Reliability Engineering Leader.
This role provides bench depth and succession coverage for REO's most technically complex domains while driving innovation in reliability, resilience, and performance practices.
Strategic influence: Shapes cloud reliability, quality engineering, and resilience strategy across REO and Architecture domains.
Operational authority: Leads Sr. Managers and Managers who own the execution of quality, resilience, and performance engineering capabilities.
Enterprise collaboration: Works hand-in-hand with Technology, Solution, Business, Data, and Enterprise Architects to embed reliability and resilience as core architecture principles.
Scope of Accountability
Resilience Engineering & Cloud Reliability
Architect and validate fault-tolerant, regionally resilient architectures across AWS, Azure, and GCP.
Own resilience automation, chaos testing, and IaC-based recovery validation.
Lead cross-cloud reliability design reviews and failure-mode analyses for critical systems.
Quality Engineering & Continuous Testing
Define enterprise-wide quality engineering strategy integrated into CI/CD pipelines.
Drive automation-first testing (functional, non-functional, performance, resilience).
Embed observability-driven quality validation and contract testing across services.
Performance, Capacity & Efficiency Engineering
Oversee predictive capacity planning, scaling automation, and cost/efficiency optimization (FinOps/GreenOps).
Partner with Platform & Infrastructure teams to tune performance across application and platform layers.
Measure and report on performance SLIs/SLAs aligned to REO's Reliability Metrics framework.
Cross-Domain Architecture Collaboration
Partner with Enterprise Architects to codify resilience and reliability standards in technology blueprints.
Collaborate with Technology & Solution Architects to design service reliability into delivery architectures.
Engage Data Architects for data resilience, replication, and pipeline reliability.
Work with Business Architects to align technical reliability goals with critical business outcomes.
Leadership & Talent Development
Lead a team of Sr. Managers and Managers, fostering a high-performance, hands-on engineering culture.
Build and mentor top-tier technical talent in cloud reliability, resilience, and quality automation.
Partner with HR and REO Enablement to develop succession plans and technical competency frameworks.
Core Technical Competencies
AWS (primary) - Multi-account design, HA architecture, region failover, resilience automation, Terraform/CDK/CloudFormation.
Azure & GCP (secondary) - Compute, networking, and reliability constructs; hybrid cloud design and failover integration.
Infrastructure as Code (IaC) - Deep proficiency in Terraform, policy-as-code (OPA/Conftest), drift detection, pipeline integration.
Reliability & Chaos Engineering - AWS Fault Injection Simulator, Gremlin, steady-state hypothesis design.
Observability & Quality Automation - OpenTelemetry, Prometheus, CloudWatch, K6, Gatling; CI/CD quality gates and dashboards.
Performance Engineering - Load, stress, and soak testing automation; performance profiling and SLO alignment.
Disaster Recovery Automation - Cross-region orchestration, IaC-driven DR runs, replication validation.
FinOps/GreenOps - Cloud cost and efficiency automation, carbon-aware scaling policies.
Leadership Competencies
Strategic Technical Leadership: Operates at the intersection of deep engineering and executive strategy.
Multi-Domain Collaborator: Integrates reliability and resilience across architecture, operations, and business domains.
Talent Multiplier: Develops and empowers senior managers, fostering engineering mastery and innovation.
Credible Technical Authority: Trusted peer to Infrastructure & Reliability Engineering; capable of leading architecture reviews and executive briefings.
Change Champion: Drives transformation of reliability practices across platforms, pipelines, and teams.
Qualifications & Experience
12+ years in cloud engineering, reliability, or platform leadership roles.
5+ years leading Sr. Managers/Managers in technical domains.
Proven expertise across AWS, with working knowledge of Azure and GCP.
Experience with multi-cloud governance, DR design, IaC at scale, and reliability automation.
Strong understanding of observability, SRE principles, and REO/ITIL-aligned reliability frameworks.
Certifications:
Required: AWS Certified Solutions Architect - Professional
Preferred: AWS DevOps Engineer, Azure Solutions Architect Expert, Google Professional Cloud Architect
Success Metrics
99.9% availability maintained for Tier-1 workloads.
100% coverage of DR automation for Tier-1 services.
25% annual increase in automated quality/test coverage.
15% annual improvement in resource efficiency and cost performance.
Documented resilience participation across all enterprise architecture blueprints.
Positive "technical peer readiness" and succession rating from Head of REO.
Summary Value Proposition
This Director role blends deep AWS reliability engineering expertise, multi-cloud technical breadth, and leadership scale.
It ensures REO maintains both technical depth and leadership redundancy, and it strengthens the bridge between engineering execution and enterprise architecture alignment.
#LI-hybrid **This position will be in a hybrid working arrangement.**
Securian Financial believes in hybrid work as an integral part of our culture. Associates get the benefit of working both virtually and in our offices. If you're in a commutable distance (90 minutes), you'll join us 3 days each week in our offices to collaborate and build relationships. Our policy allows flexibility for the reality of business and personal schedules.
The estimated base pay range for this job is:
$145,000.00 - $267,000.00Pay may vary depending on job-related factors and individual experience, skills, knowledge, etc. More information on base pay and incentive pay (if applicable) can be discussed with a member of the Securian Financial Talent Acquisition team.
Be you. With us. At Securian Financial, we understand that attracting top talent means offering more than just a job - it means providing a rewarding and fulfilling career. As a valued member of our high-performing team, we want you to connect with your work, your relationships and your community. Enjoy our comprehensive range of benefits designed to enhance your professional growth, well-being and work-life balance, including the advantages listed here:
Paid time off:
We want you to take time off for what matters most to you. Our PTO program provides flexibility for associates to take meaningful time away from work to relax, recharge and spend time doing what's important to them. And Securian Financial rewards associates for their service by providing additional PTO the longer you stay at Securian.
Leave programs: Securian's flexible leave programs allow time off from work for parental leave, caregiver leave for family members, bereavement and military leave.
Holidays: Securian provides nine company paid holidays.
Company-funded pension plan and a 401(k) retirement plan: Share in the success of our company. Securian's 401(k) company contribution is tied to our performance up to 10 percent of eligible earnings, with a target of 5 percent. The amount is based on company results compared to goals related to earnings, sales and service.
Health insurance: From the first day of employment, associates and their eligible family members - including spouses, domestic partners and children - are eligible for medical, dental and vision coverage.
Volunteer time: We know the importance of community. Through company-sponsored events, volunteer paid time off, a dollar-for-dollar matching gift program and more, we encourage you to support organizations important to you.
Associate Resource Groups: Build connections, be yourself and develop meaningful relationships at work through associate-led ARGs. Dedicated groups focus on a variety of interests and affinities, including:
Mental Wellness and Disability
Pride at Securian Financial
Securian Young Professionals Network
Securian Multicultural Network
Securian Women and Allies Network
Servicemember Associate Resource Group
For more information regarding Securian's benefits, please review our Benefits page.
This information is not intended to explain all the provisions of coverage available under these plans. In all cases, the plan document dictates coverage and provisions.
Securian Financial Group, Inc. does not discriminate based on race, color, religion, national origin, sex, gender, gender identity, sexual orientation, age, marital or familial status, pregnancy, disability, genetic information, political affiliation, veteran status, status in regard to public assistance or any other protected status. If you are a job seeker with a disability and require an accommodation to apply for one of our jobs, please contact us by email at , by telephone (voice), or 711 (Relay/TTY).
To view our privacy statement click here
To view our legal statement click here
Remote working/work at home options are available for this role.
Job Title : Senior Site Reliability Engineer
Location : Charlotte, NC/ Columbus, OH – Hybrid (3 days onsite a week)
Duration : Contract role (W2)
In-person Interview required in NJ or NC on 21st Saturday March
Job Description:
Tech Stack: Java/J2EE (Spring, Spring Boot, Python, Shell Scripting, Kafka, Oracle, MongoDB etc.).
- 10+ years of Software Engineering experience
- 5+ years of experience in Site Reliability Engineering teams with continued focus on improving Platform health
- Familiar with Agile or other rapid application development practices
- Hands-on expertise in building dashboards using APM tools.
- Experience with distributed (multi-tiered) systems, algorithms, relational databases, and NoSQL databases.
- Knowledge & Exposure caching tools (Redis, memcache) or messaging tools such as MQ, Kafka.
- Must have working knowledge of APM tools such as splunk, GCL, ELK, Grafana, Prometheus etc.
- Able to create Dashboards using GCL/Splunk/ELK and setup alerts.
- Working knowledge of CICD is a plus – Source control like Git, Continuous Integration – Jenkins / UCD Release etc. .
- Ability to work with Engineering teams across the ecosystem such as Security, Networking & Infrastructure challenges which can impact platform health & resiliency.
- Shell Scripting / DevOps tools like Ansible with good knowledge of yaml file to write playbooks .
- Experience with distributed storage technologies like NFS as well as dynamic resource management frameworks PCF, Kubernetes / OpenShift, AWS or Azure.
- A proactive approach to spotting problems, areas for improvement, and performance bottlenecks
Job Title: RedHat OpenShift & Kubernetes SME
Location: Princeton - NJ - 08540
Mode : Contract (6+ Months) – Onsite
Min 15 Years of experience required.
Qualifications:
Design, deploy and maintain Red Hat OpenShift and Rancher Managed Kubernetes Clusters
Architect Highly available, scalable, and secure container platforms
Install, configure, upgrade and patch OpenShift and Rancher Clusters
Implement logging, monitoring, and alerting (Prometheus, Grafana, EFK etc.)
Troubleshoot Cluster, Networking, Storage, and application issues
Perform root cause analysis and provide performance optimization
Act as an SME for OpenShift and Rancher Technologies
Provide guidance to Customer and application teams
Create documentation, standards, and operational runbooks
Strong Hands-on experience with RedHat Open Shift and Rancher (RKE, RKE2)
Expert knowledge of Kubernetes architecture and Operations
Experience supporting mixed OS environments (Windows and Linux).
Excellent communication skills, able to explain complex concepts to technical and non-technical audiences.
Demonstrated ability to work independently and as part of a team.
Relevant certifications (RHCA, CKA, CKAD, etc.) and active participation in the Kubernetes community are a plus.
Experience with CI/CD Pipelines
Overview
Architects and builds the infrastructure and tooling that powers AI agent development across the Software Development Lifecycle (SDLC). Develops production-grade agentic systems, orchestration frameworks, and observability solutions that enable teams to build, deploy, and monitor reliable AI agents at scale. Plays a key role in defining and implementing the next generation of SDLC through AI-first innovation and comprehensive instrumentation.
What We're Looking For
You demonstrate sharp product sense for high-impact automation opportunities, technical taste in implementation decisions, and the ability to clearly articulate trade-offs. You know when to apply AI agent solutions versus simpler approaches and can explain the \"why\" behind architectural choices.
You excel at 0-to-1 (and 1-to-100) product development, comfortable operating in ambiguous environments where requirements emerge through experimentation and iteration rather than upfront specification.
Key Responsibilities
AI Agent Development & Automation:
• Develop production-grade AI agents that eliminate manual handoffs across the SDLC
• Create custom integrations and CLI tools that give agents deep understanding of internal systems and codebases
• Design comprehensive testing strategies to ensure agent reliability and output quality
• Implement \"Golden Path\" scaffolding that embeds organizational standards into new projects
• Build AI solutions that improve codebase navigation, documentation, and developer workflows
• Identify workflow bottlenecks and deliver measurable impact through intelligent automation
• Shape SDLC evolution by identifying AI-first opportunities and proving outcomes through experimentation
Agent Infrastructure & Platform:
• Architect and maintain production infrastructure supporting agent deployment, lifecycle management, and scaling
• Develop agent frameworks, templates, and SDKs that accelerate agent development
• Create governed Model Context Protocol (MCP) catalog enabling compliant agent-to-agent and agent-to-MCP communication
• Implement governance controls for agent behavior, permissions, and system access
Observability & Performance Analytics:
• Design and implement metrics, monitoring, and logging infrastructure for AI agents and development workflows
• Build dashboards that provide actionable insights into developer productivity, tool adoption, and agent performance
• Establish KPIs and measurement frameworks to quantify the impact of AI-powered automation
• Create alerting and anomaly detection systems to ensure reliability of agents and tooling
• Analyze telemetry data to identify optimization opportunities and guide strategic investment decisions
Collaboration & Impact:
• Partner across teams to drive adoption of AI-powered tooling and process transformation
• Stay current with LLM technologies and coach colleagues on AI-assisted development and automation best practices
• Rapidly prototype solutions to validate use cases and prove value quickly
• Communicate data-driven insights to stakeholders through clear visualizations and reports
Preferred Qualifications:
• 5-7+ years of software engineering experience building production systems
• Proven experience building agentic systems using LLM orchestration frameworks
• Hands-on expertise with AI-powered development tools (code assistants, AI-enhanced editors)
• Strong foundation in SDLC, system design, and internal tooling development
• Experience with observability tools and practices including metrics collection, logging frameworks, and dashboard development
• Full-stack technical proficiency:
• Languages: Java, Python, JavaScript/TypeScript
• Frameworks: Angular, Spring Boot
• CI/CD platforms and cloud infrastructure (AWS)
• Monitoring/observability tools (e.g., Prometheus, Grafana, CloudWatch)
• Passion for transforming software development through AI innovation and data-driven decision making
# LI-CGTS
# TS-2505
Position Summary
Our client is building a modern, cloud-native platform that powers connected, data-driven manufacturing operations. Their technology sits at the center of increasingly automated factories, integrating equipment, software systems, and real-time production data into a scalable SaaS platform used by global manufacturers.
To support rapid growth and platform scale, they are seeking a Senior Cloud Operations Engineer to own the reliability, performance, and operational excellence of their cloud infrastructure. This is a highly impactful role responsible for ensuring the platform remains highly available, secure, and scalable as adoption continues to grow.
This position is ideal for engineers who thrive in modern cloud environments, enjoy solving complex reliability challenges, and prefer automating everything possible. The right person will combine deep technical expertise with strong operational discipline, helping build a world-class cloud platform supporting real industrial environments.
Key Responsibilities
Cloud Operations & Reliability
• Maintain and optimize production, staging, and development environments running in Kubernetes on AWS
• Implement and manage monitoring, logging, alerting, and observability frameworks
• Lead incident response efforts and drive post-incident reviews focused on continuous improvement
• Own backup, disaster recovery, and business continuity processes
• Perform system capacity planning and performance tuning
Automation & Infrastructure Management
• Build and maintain Infrastructure-as-Code using tools such as Terraform or Pulumi
• Automate provisioning, configuration management, and environment lifecycle processes
• Identify and eliminate operational inefficiencies through automation
• Manage secrets, environment configuration, and version control across infrastructure environments
Security & Compliance
• Implement and maintain least-privilege access models and cloud security guardrails
• Support vulnerability management, patching workflows, and dependency maintenance
• Assist with compliance readiness efforts including SOC 2, ISO 27001, or similar frameworks
• Ensure proper logging, retention, and audit practices across cloud environments
FinOps / Cost Optimization
• Monitor and optimize cloud spend across services and environments
• Implement tagging standards, budget alerts, and cost visibility frameworks
• Recommend architectural improvements to balance performance and cost efficiency
Collaboration & Leadership
• Partner closely with engineering teams to improve reliability, deployment pipelines, and system architecture
• Mentor engineers on operational best practices and cloud platform management
• Develop runbooks, documentation, and operational standards
• Champion reliability engineering principles, operational maturity, and risk reduction practices
Technical Environment
Candidates should be comfortable working in modern cloud-native environments and familiar with:
• Kubernetes clusters, autoscaling, Helm charts, and service mesh concepts
• AWS cloud services including compute, networking, storage, and cost management
• Infrastructure-as-Code frameworks such as Terraform
• Observability platforms such as Datadog, CloudWatch, Prometheus, or New Relic
• CI/CD tools such as GitHub Actions, Bitbucket Pipelines, or Bamboo
• Linux systems administration and troubleshooting
• SRE practices including SLIs, SLOs, MTTR, RTO/RPO, and incident management
Must have
Teradata platform expertise
• Deep knowledge of Teradata architecture: parsing, BYNET, AMP, vproc, fallback, hashing, PDCR, fallback, and spool management.
• Data distribution and primary index design; collecting statistics and understanding optimizer behavior.
• Experience with recent Teradata versions and releases migration/upgrade planning: TD 16.XX, TD 17.XX and preferably TD 20.XX.
System administration
• Provisioning and managing Teradata nodes and clusters (physical and virtual).
• OS-level skills: Linux administration (SLES/RHEL/CentOS/Oracle Linux) for Teradata on Linux, including kernel tuning, package management, user and permissions management.
• Storage subsystem knowledge: SAN, NAS, Fibre Channel, LUNs, RAID, and how storage impacts Teradata I/O and spool.
Performance tuning and troubleshooting
• SQL query and plan analysis; collecting and interpreting Explain plans.
• Workload management (WLM) and resource allocation: query prioritization, throttling, and KRI/SLAs.
• Monitoring and diagnostics: using Teradata tools and logs to analyze spool, CPU, memory, disk I/O, network, BYNET contention.
Backup, recovery & high availability
• Best practices for backups restore procedures, and disaster recovery (DR) planning and testing.
• Knowledge of fallback, AMP resilience, replication methods and physical vs logical protection.
Security & compliance
• DB and platform-level security: roles, privileges, LDAP/Kerberos integration, encryption (at rest/in transit), auditing and compliance (SOx and Others as applicable).
• Secure configuration and hardening practices.
Networking & infrastructure
• Network architecture for Teradata clusters, VLANs, link aggregation, low-latency requirements, and BYNET tuning.
• Integration with enterprise infrastructure: DNS, NTP, monitoring stacks, and identity providers.
Automation, scripting & tools
• Scripting languages: Bash, Python, Perl for automation, maintenance, and custom monitoring. – one of them
• Configuration management and automation tools: Ansible, Terraform, Chef, or Puppet (as used in the enterprise). – one of them
• Familiarity with Teradata utilities and tools: BTEQ, FastLoad, MultiLoad, TPT (Teradata Parallel Transporter), DBSControl, Viewpoint, Teradata Studio/SQL Assistant. – one of them
Observability & tooling
• Use of monitoring/alerting tools (Viewpoint, Prometheus, Grafana, Splunk, Nagios, etc.) and designing dashboards and alerts. One of them, View point is mandatory
• Capacity planning, trending, and forecasting for CPU, disk, spool, and concurrency.
Soft skills & organizational capabilities
• Incident management and on-call experience
• Leading postmortems, RCA (root-cause analysis), implementing corrective actions.
• Communication and stakeholder management: vendors, management and applications
• Translate technical impacts to business stakeholders; coordinate with DBAs, developers, network/storage teams, and vendors.
Role and Responsibilities
Installs, configures and upgrades Teradata software and related products.
• Backup, restore, migrate Teradata data and objects
• Establish and maintain backup and recovery policies and procedures.
• Manages and monitor system performance. proactively monitor the database systems to ensure secure services with minimum downtime
• Implements and maintains database security.
• Sets up and maintains documentation and standards.
• Supports multiple Teradata Systems including independent marts/ enterprise warehouse.
• Work with the team to ensure that the associated hardware resources allocated to the databases and to ensure high availability and optimum performance.
• Responsible for improvement and maintenance of the databases to include rollout and upgrades.
• Responsible for implementation and release of database changes as submitted by the development team, Working with end customer
• Teradata, customer, datacenter, vendor co-ordinations
• Forecast data, security audits
• User account and access management
• Teradata active system management and customer requests and system allocation
• Backup and recovery
• SOX compliance and audits
• DB support from 3rd party vendors
• Product evaluations
• On call support and major incidents
• Backup restore, frequency and retention
• Disaster recovery
• Create long r
Weeks Group, LLC is a leading construction firm specializing in the development of advanced data center facilities. With a strong commitment to innovation, quality, and client satisfaction, we deliver cutting-edge solutions that address the dynamic needs of the data center industry. As we continue to expand, we are seeking a skilled and experienced Data Center Construction QAQC Manager to join our dream team. We are not headhunters. We don't just put butts in seats. We are a dream team of experts in the industry to thrive from solving problems and getting things done!
Weeks Group's Values:
We Answer the Call
Integrity- Honesty-Trust- Nimbleness
We Don’t Take No for an Answer
Persistence- Determination- Accountable
We Solve Problems
We Work Hard and Reward Well
Within Challenging, Intense Projects
We Expect the Best from Each Other
Teamwork- Communication
We BTFM
Innovative- Disdain for Mediocrity
If you don't have data center experience or don't align with our values, no need to apply.
Employment Type: Full-time-Traveling position option
Project Type: Hyperscale / Mission Critical Data Centers – Brownfield (live campus / retrofit / expansion)
Reports To: Project Director / Director of Construction Operations
Role Summary
We’re hiring an On-Site QA/QC Manager to lead the quality program on brownfield hyperscale data center construction—where safety, uptime, and precision matter as much as speed. You’ll own electrical QA/QC planning and execution, drive rigorous documentation, and ensure installations meet strict client standards, contract requirements, and code while working in/around live critical environments. This role supports readiness for energization, commissioning, and IST with strong change control and zero-surprise turnover.
What You’ll Do
- Own and maintain the Project Quality Plan (PQP) tailored for brownfield constraints (phasing, outages, access controls, change control).
- Build and manage electrical Inspection & Test Plans (ITPs), checklists, and hold/witness points—by system, room, and phase.
- Lead daily QA/QC field execution and verification against IFC drawings, approved submittals, vendor IOMs, RFIs, and method statements.
- Drive quality for the electrical critical path, including (as applicable):
- MV/LV distribution: switchgear, transformers, breakers, relays, terminations
- UPS/battery systems: installation verification, clearances, labeling, startup readiness
- Generators/paralleling gear: interface readiness, documentation capture, punch closure
- Busway/PDUs/RPPs: supports, alignment, tap boxes, labeling, grounding/bonding
- Cable tray/conduit: routing, supports, firestopping, separation, workmanship standards
- Grounding & bonding: integrity verification and as-built accuracy
- Controls/EPMS/BMS electrical interfaces: device placement, labeling, point-to-point readiness (as assigned)
- Enforce brownfield-specific quality disciplines:
- Verify phasing plans and temporary power installs meet requirements
- Maintain as-built accuracy in real time due to live site impacts and field changes
- Coordinate quality gates tied to shutdown windows, cutovers, and turnover milestones
- Manage deficiency systems: NCRs, punch lists, rework prevention, corrective/preventive action (CAPA), re-inspections, and verified closeout.
- Partner tightly with Operations, Controls, Commissioning, and Safety to ensure quality supports uptime protection and controlled energization.
- Own electrical turnover packages: inspection reports, test results, redlines/as-builts, O&Ms, training logs, vendor startup documentation, commissioning support documentation.
- Provide weekly reporting: trends, repeat issues, risk register inputs, and 2–6 week quality look-ahead tied to phasing and outage schedules.
Qualifications
- 7+ years QA/QC experience on mission critical construction with strong electrical focus; brownfield/live siteexperience strongly preferred.
- Proven success running PQP/ITP programs, NCR/punch systems, and turnover documentation on fast-track or phased retrofits.
- Strong ability to interpret one-lines, schematics, control wiring diagrams, specs, and vendor documentation.
- Working knowledge of NEC/NFPA 70 and typical hyperscale QA requirements (labeling standards, documentation rigor, readiness gates).
- Highly organized, strong communicator, and able to coordinate across multiple trades, vendors, and stakeholders in a controlled environment.
Preferred
- Experience supporting cutovers, shutdown windows, energization planning, commissioning readiness, and IST
- Familiarity with NFPA 70E-related interfaces and verification of torque/labeling/test documentation programs
- Certifications: CQM-C, ASQ (CQA/CQE), OSHA 30
- Tools: Procore, ACC/BIM 360, Bluebeam, PlanGrid
What Success Looks Like
- Zero “surprise” quality issues during shutdown windows and cutovers
- Electrical systems pass startup/commissioning on first attempt
- NCR/punch stays controlled and closes quickly ahead of milestones
- Turnover packages are complete, accurate, and accepted without rework
Benefits
- Competitive compensation + bonus potential
- Health/dental/vision, 401(k), PTO
- Per diem/vehicle allowance (if applicable)
- Growth path within hyperscale mission critical delivery
Senior Software Engineer – Deployment & Reliability (Digital Pathology / Medical Imaging)
A fast-growing technology company operating in the digital pathology and medical imaging space is seeking a Senior Software Engineer to support the deployment, configuration, and long-term reliability of advanced imaging and AI-driven software systems.
This role sits at the intersection of software deployment, infrastructure engineering, and site reliability, ensuring complex software platforms are successfully installed, integrated with customer IT environments, and maintained at high levels of performance and stability.
You will work closely with engineering, customer support, and monitoring teams to ensure a smooth transition from system deployment to ongoing operational support while contributing to improvements that make deployments more scalable and reliable over time.
Key Responsibilities
Deployment & Configuration
- Lead end-to-end deployments of imaging, AI, and data management software systems at customer environments
- Configure and integrate servers, clusters, and storage systems within hospital or laboratory IT infrastructures
- Work with networking, authentication, storage, and security configurations to ensure successful installations
- Collaborate with field engineering teams during system installation and commissioning
- Develop standardized deployment playbooks, documentation, and validation checklists
System Reliability & Upgrades
- Manage software version rollouts, upgrades, and patching across deployed customer environments
- Work with monitoring and observability teams to track system performance and health
- Troubleshoot complex issues across multi-component systems including imaging software, AI inference pipelines, and storage layers
- Improve automation around upgrades, rollbacks, and maintenance processes
Engineering Collaboration & Continuous Improvement
- Identify recurring deployment or performance challenges and work with R&D teams to design long-term solutions
- Provide structured feedback from field deployments to improve product architecture and deployment workflows
- Validate new deployment tools, frameworks, and configuration approaches prior to wider rollout
- Contribute to improving the scalability and resilience of the overall platform
Customer IT & Cross-Functional Collaboration
- Serve as a technical liaison with customer IT teams regarding networking, infrastructure, security, and data access
- Ensure deployments comply with institutional IT policies and healthcare regulatory requirements
- Collaborate closely with support and monitoring teams to align escalation processes and root cause investigations
- Participate in post-deployment reviews to improve operational processes and reliability
Documentation & Knowledge Sharing
- Maintain detailed installation and configuration documentation
- Develop deployment guides, troubleshooting documentation, and internal knowledge resources
- Support and mentor field teams on standardized deployment and configuration practices
Requirements
- Bachelor’s or Master’s degree in Computer Science, Computer Engineering, or related discipline
- 5+ years of experience in software deployment, DevOps, infrastructure engineering, or systems engineering
- Strong Linux (Ubuntu) administration and scripting skills
- Experience with containerization and orchestration technologies (Docker, Kubernetes)
- Experience with database technologies such as PostgreSQL or MongoDB
- Familiarity with web service configuration (Nginx or Apache)
- Solid understanding of networking concepts including VPNs, firewalls, and authentication systems
- Ability to troubleshoot complex distributed systems across software, infrastructure, and data layers
- Strong communication and collaboration skills when working with cross-functional teams and customer IT stakeholders
Preferred Experience
- Exposure to medical imaging systems, digital pathology, or healthcare technology environments
- Familiarity with DICOM or PACS systems
- Experience deploying or supporting AI/ML models in production environments
- Experience with observability and monitoring tools (Prometheus, Grafana, ELK)
- Knowledge of regulated environments and healthcare compliance frameworks (HIPAA, GDPR, IVDR)
- Experience supporting hardware and software integrated systems
Why This Role
This position offers the opportunity to work on advanced digital pathology and imaging technologies that support clinical diagnostics and research globally. The role combines hands-on technical deployment with the chance to influence how complex systems are designed, automated, and scaled across a growing global customer base.
About Us:
Astiva Health, Inc., located in Orange, CA is a premier health plan provider specializing in Medicare and HMO services. With a focus on delivering comprehensive care tailored to the needs of our diverse community, we prioritize accessibility, affordability, and quality in all aspects of our services. Join us in our mission to transform healthcare delivery and make a meaningful difference in the lives of our members.
SUMMARY:
We are seeking a skilled and adaptable Junior AI/ML Engineer to join our fast-moving team building impactful AI solutions in healthcare. Our work focuses on extracting and interpreting data from unstructured medical documents, improving clinical coding accuracy, streamlining administrative processes, and enhancing patient outreach.
Projects will evolve rapidly, from fine-tuning large language models (LLMs) on specialized medical PDFs, to optimizing OCR pipelines in Azure, and new challenges emerge regularly. This role suits someone who thrives in ambiguity, enjoys hands-on model development, and wants to directly influence healthcare delivery through applied AI/ML.
ESSENTIAL DUTIES AND RESPONSIBILITIES include the following:
- Design, fine-tune, and optimize large language models (LLMs) and multimodal models for healthcare-specific NLP tasks, such as information extraction, classification, and summarization from clinical documents (e.g., medical charts, patient files, scanned forms).
- Develop and improve document understanding pipelines, including fine-tuning OCR / layout-aware models (especially in cloud environments like Azure AI, Azure Foundry) to handle real-world variability in medical forms, handwriting, and scanned PDFs.
- Build and iterate on end-to-end ML solutions that transform unstructured healthcare data into structured, actionable insights
- Collaborate closely with clinicians, product managers, data annotators, and engineers to define problems, curate/annotate datasets, evaluate model performance against clinical and business metrics, and iterate quickly.
- Deploy models into production environments (cloud-based inference, batch processing, or API endpoints) with attention to latency, cost, scalability, and healthcare compliance considerations (HIPAA, data privacy).
- Stay current with advancements in LLMs, vision-language models, efficient fine-tuning techniques (LoRA/QLoRA, PEFT), RAG, multimodal AI, and domain-specific healthcare AI research.
- Contribute to a culture of rapid prototyping, rigorous evaluation, and continuous improvement in a dynamic project landscape where priorities can shift based on new opportunities or stakeholder needs.
- Other duties as assigned
REQUIRED TECHNICAL SKILLS:
- Proficiency in Python and familiarity with common ML frameworks (e.g., PyTorch, TensorFlow, scikit-learn)
- Experience applying NLP techniques to unstructured text
- Hands-on experience working with LLMs, including:
- Prompt design and iteration
- Using pre-trained models for classification or extraction tasks
- Foundational understanding of model fine-tuning, such as:
- Fine-tuning transformer models or LLMs for classification or information extraction
- Adapting existing training scripts or examples to new datasets
- Familiarity with model evaluation metrics (precision, recall, F1) and basic error analysis
- Experience working with labeled datasets and annotation outputs, including reviewing label quality
- Understanding of common ML problem types, including binary and multi-label classification
- Awareness of model bias, label noise, and false positives, with the ability to discuss tradeoffs and mitigation strategies
- Basic understanding of production ML workflows (versioning, reproducibility, monitoring concepts)
OTHER SKILLS and ABILITIES:
- Hands-on fine-tuning experience with LLMs (e.g., Hugging Face, OpenAI fine-tuning, Azure Foundry), even if limited to small-scale or academic projects
- Exposure to cloud ML platforms (Azure ML, AWS SageMaker, or GCP)
- Familiarity with RAG architectures and retrieval-based grounding
- Experience with NLP libraries (spaCy, Hugging Face Transformers, NLTK)
- Introductory experience with weak supervision or noisy-label learning
- Interest in healthcare or biomedical NLP
- Curiosity about knowledge graphs, ontologies, or structured prediction
- Familiarity with secure data handling practices
- Willingness and ability to learn workflows for sensitive or regulated data (e.g., HIPAA-covered healthcare data), including privacy-aware data handling and secure ML workflows
EXPERIENCE:
- Bachelor’s Degree in related field
- 1–2 years of experience in machine learning, applied NLP, or software engineering
- Demonstrated some experience training or fine-tuning ML models, not just using APIs
- Ability to collaborate with senior engineers and domain experts and incorporate feedback
BENEFITS:
- 401(k)
- Dental Insurance
- Health Insurance
- Life Insurance
- Vision Insurance
- Paid Time Off
- Free catered lunches
We’re looking for a dynamic, hands-on sourcing professional who can help build and scale a best-in-class sourcing program supporting Private Label and New Product Development. You’ll partner closely with cross-functional leaders to identify the right suppliers, negotiate strong commercial agreements, and create repeatable sourcing processes that improve speed-to-market, cost, and supply continuity.
Summary:
The Sourcing Manager is an individual contributor responsible for leading end-to-end sourcing for Private Label and New Product Development. This role builds repeatable RFx and supplier selection processes, develops supplier partnerships, and translates cross-functional requirements into commercially sound recommendations and agreements. Success requires balancing cost, quality, risk, and speed to enable efficient, compliant product launches and a stronger supplier ecosystem.
Organizational Impact:
Reporting to the Senior Manager, Sourcing, this role will expand sourcing capability by creating scalable tools, templates, and governance that improve speed-to-market, supplier performance, and total cost outcomes. Your work will directly impact new product launch readiness, supply continuity, gross margin, and risk mitigation through strong supplier selection, commercial negotiations, and disciplined performance management.
What Success Looks Like (First 6–12 Months):
- Establish and socialize a clear sourcing intake and RFx process (templates, timeline, roles/RACI, evaluation criteria)
- Deliver on-time supplier selection and contracting for priority NPD/Private Label launches
- Build a qualified supplier pipeline (including international options where appropriate) across priority categories
- Implement basic supplier performance management (KPIs, scorecards, QBR cadence) for awarded suppliers
- Identify and deliver measurable value (TCO improvements, cost avoidance, risk reduction, lead-time and service improvements)
Key Deliverables:
- Standard RFx toolkit (RFI/RFP/RFQ templates, evaluation scorecards, award memo format)
- Supplier due diligence and onboarding checklist (quality, regulatory, capacity, financial, ESG as applicable)
- Negotiation playbook and contracting checklist (commercial terms, SLAs, lead times, payment terms)
- Supplier performance dashboard and QBR agenda
- Category/supplier landscape view for priority areas (options, risks, and recommendations)
Essential Duties and Responsibilities:
- Execute sourcing strategy for Private Label & New Product Development through day-to-day ownership of initiatives, insights, and recommendations
- Build and improve repeatable sourcing processes and governance across Marketing, Product, Quality/Regulatory, Operations, Finance, and Legal
- Lead complex sourcing initiatives end-to-end, managing stakeholders, timelines, and deliverables
- Develop category strategies (make/buy, supplier segmentation, dual sourcing, risk mitigation) informed by market intelligence and business needs
- Own end-to-end RFx events (RFI/RFP/RFQ): strategy, supplier engagement, evaluation, award, and transition to performance management
- Create standardized templates and scorecards that balance total value (price, lead time, quality, service, innovation, sustainability)
- Lead negotiations to optimize total cost of ownership (TCO) and value creation (rebates, payment terms, delivery, SLAs, IP considerations)
- Develop and manage a supplier network, building partnerships that deliver innovation, capacity, quality, and competitive advantage
- Drive supplier performance management (KPIs, dashboards, quarterly business reviews), continuous improvement, and corrective actions
- Conduct market intelligence to understand supply/demand dynamics, cost drivers, regulatory changes, and geopolitical risk
- Partner with Product, Engineering, and Quality to accelerate Private Label and NPD pipelines—from concept to commercialization
- Support proto sampling, validation, and scale-up activities in alignment with quality standards and regulatory requirements
- Ensure design-for-supply, manufacturability, and sustainability are embedded early in product development
- Lead cost modeling, scenario analysis, and benchmarking to inform awards and portfolio decisions
- Track performance to plan (savings, cost avoidance, working capital, resiliency), reporting outcomes and insights to leadership
- Additional job duties as assigned
Skills/Experience Required:
- Education: Bachelor’s degree in Business, Supply Chain, Engineering, or related field
- 5+ years’ experience in sourcing, procurement, and/or purchasing environments supporting product development and commercialization; medical device, medical/clinical expertise, or prior health care experience strongly desired
- Experience working with 3rd party contract manufacturers and/or direct manufacturing partners (medical devices or other healthcare solutions preferred)
- International sourcing experience preferred
- Experience with strategic sourcing and improving supplier performance
- Familiarity with contracting language and experience negotiating contracts with suppliers
- Understanding manufacturing and quality validation processes and best practices preferred
- Strong knowledge of supply chain principles and processes
- Strength in negotiations, cost/price analysis, and purchasing procedures
- Knowledge of bids, RFx events (RFI/RFP/RFQ), and reverse auctions
- Understanding of new product launch and commercialization; experience in product development and manufacturing processes desired
- Excellent communication skills (written and verbal) with demonstrated ability to lead and influence at all levels, including senior stakeholders and business leaders
- Experience with project planning and project management; ability to lead cross-functional project teams
- Proven ability to work successfully in a deadline-driven environment with a sense of urgency
- Proficiency with Microsoft Office (advanced Excel and PowerPoint); experience with CRM and/or sourcing tools a plus
Sarnova is an Equal Opportunity Employer. We offer a competitive salary, commensurate with experience, along with a comprehensive benefits package, including 401(k) Plan. EEO/M/F/Veterans/Disabled. Our mission is to be the best partner for those who save and improve patients’ lives. Excellence in delivering upon our mission is dependent upon having a diverse team that is empowered to bring their full, authentic self to work each day. We strive to create a workplace that reflects the communities we serve, and we are passionate about creating an inclusive workplace that promotes and values diversity.