Technical Architect Jobs in California

Technical Architect jobs in California represent one of the most active markets in the country, with strong demand in enterprise technology, cloud infrastructure, financial services, and the defense and aerospace sectors, covering roles from associate-level through principal and distinguished engineer. The heaviest concentration of hiring is in the San Francisco Bay Area, Los Angeles, and San Diego, where anchoring employers such as Salesforce, Northrop Grumman, and Wells Fargo consistently maintain technical architect headcount. Solution architecture for cloud platforms, integration architecture, and security architecture are the specialties California employers request most often. Find a role that fits below and apply directly.

Find Technical Architect Jobs

Overview

Open roles632+

Top employerNVIDIA

Top citySan Francisco, CA

Cities hiring62

Work type70% On-site

Top industryTechnology

Showing 5 of 632+ Technical Architect jobs

Technical Architect

We won't show you this job again

New 6h ago

Accellor

Technical Architect

San Francisco, California

New 6h ago

Accellor

San Francisco, California

Software Engineering

Cloud & DevOps

Technical Product & Program Management

Data Science & Analytics

Software Engineering

Cloud & DevOps

Technical Product & Program Management

Data Science & Analytics

Salary TBD

10+ yrs exp.

On-Site

None

On-Site

None

Have you applied for this role?

Technical Architect

We won't show you this job again

New 6h ago

Accellor

Technical Architect

Mountain View, California

New 6h ago

Accellor

Mountain View, California

Software Engineering

Cloud & DevOps

Data Science & Analytics

Technical Product & Program Management

Software Engineering

Cloud & DevOps

Data Science & Analytics

Technical Product & Program Management

Salary TBD

10+ yrs exp.

On-Site

None

On-Site

None

Have you applied for this role?

Senior Technical Architect/Specifier

We won't show you this job again

Added 5d ago

ECS Limited

Senior Technical Architect/Specifier

Walnut Creek, California

Added 5d ago

ECS Limited

Walnut Creek, California

Specialized Engineering

Construction Management

Engineering (Non-Software)

Specialized Engineering

Construction Management

Engineering (Non-Software)

$98k - $136k/yr

8+ yrs exp.

On-Site

Bachelor's

$98k - $136k/yr

On-Site

Bachelor's

1,001-5,000

Have you applied for this role?

Senior Technical Architect/Specifier

We won't show you this job again

Added 5d ago

ECS Limited

Senior Technical Architect/Specifier

Roseville, California

Added 5d ago

ECS Limited

Roseville, California

Specialized Engineering

Construction Management

Engineering (Non-Software)

Specialized Engineering

Construction Management

Engineering (Non-Software)

$98k - $136k/yr

8+ yrs exp.

On-Site

Bachelor's

$98k - $136k/yr

On-Site

Bachelor's

1,001-5,000

Have you applied for this role?

Senior Technical Architect/Specifier

We won't show you this job again

Added 5d ago

ECS Limited

Senior Technical Architect/Specifier

Irvine, California

Added 5d ago

ECS Limited

Irvine, California

Specialized Engineering

Construction Management

Engineering (Non-Software)

Specialized Engineering

Construction Management

Engineering (Non-Software)

$98k - $136k/yr

8+ yrs exp.

On-Site

Bachelor's

$98k - $136k/yr

On-Site

Bachelor's

1,001-5,000

Have you applied for this role?

New 6h ago

Technical Architect

Accellor·San Francisco, California

Job functions:

Software Engineering

Cloud & DevOps

Technical Product & Program Management

Data Science & Analytics

10+ yrs exp.

On-Site

None

Full Time

Job Description

INTRODUCTION

Accellor is an AI-native services firm purpose-built for the post-ChatGPT era. Free from legacy constraints, we focus on delivering measurable business outcomes through advanced AI, data, and engineering capabilities. Our mission is to operationalize AI at scale and unlock sustained enterprise value. Our offerings span AI solutions, data services, enterprise applications, and product engineering, tailored to industry-specific needs across healthcare, life sciences, telecom, retail, financial services, and technology. By leveraging design thinking and technology-agnostic architectures, we ensure faster time-to-value and seamless interoperability. With a proven track record of enabling Fortune 100 enterprises and global innovators, Accellor stands as a trusted partner for organizations seeking to harness the full potential of AI. Our vision is clear: to build intelligent, connected ecosystems that deliver measurable outcomes and redefine the future of enterprise transformation.

ROLE

Technical Architect — AI Systems & Platform Internals

Experience: 10-12 Years
Role Type: Technical Architect / Staff-Level Systems Architect

Accellor is looking for a Technical Architect — AI Systems, Inference & Platform Internals to help design, scale, and optimize the systems that power ChatGPT, OpenAI API, Codex, agentic systems, multimodal experiences, and internal research workloads. This role is focused on the internal AI systems stack, including inference runtime, model serving, GPU infrastructure, distributed systems, context engineering, cost optimization, evaluation gates, observability, release safety, and production reliability. The ideal candidate is a senior hands-on architect who can reason across the full AI platform — from GPU-level performance and distributed inference to product-scale reliability, model deployment, safety, and cost-efficient operations.

Key Responsibilities:

- AI Systems Architecture
Design and evolve large-scale AI systems that support ChatGPT, OpenAI API, Codex, agentic workflows, multimodal models, and research workloads. Define architecture across inference runtime, model serving, request routing, batching, KV-cache handling, GPU scheduling, distributed execution, observability, release gates, and production rollout. Own technical trade-offs across latency, throughput, reliability, correctness, safety, scalability, cost, and infrastructure efficiency.

Inference Runtime & Model Serving
Architect high-throughput, low-latency inference systems across large-scale GPU clusters. Work across inference engines, serving layers, scheduling systems, caching, streaming, deployment pipelines, and runtime optimization. Partner with engineering teams to improve model-serving efficiency, tail latency, GPU utilization, memory efficiency, correctness under load, and cost per request. Guide architecture decisions involving PyTorch, JAX, Triton, vLLM-style serving, CUDA/Triton kernels, distributed inference, tensor parallelism, pipeline parallelism, model sharding, and long-context serving.
GPU, Kernel & Distributed Performance
Analyze and improve performance across GPU kernels, memory movement, collective communication, orchestration, and runtime scheduling. Guide engineering decisions involving CUDA, Triton, NCCL/RCCL, GPU profiling, memory pressure, compute utilization, tensor layouts, interconnect behavior, and distributed execution. Identify system-level bottlenecks across compute, memory, networking, scheduling, model execution, and data movement.
Context Engineering
Design and guide context engineering frameworks that determine what information should be passed to the model, how it should be structured, how much context should be used, and how context quality should be measured. Own architecture patterns for prompt structure, dynamic context assembly, retrieval-augmented generation, long-context management, conversation memory, tool context, agent state, multimodal context, source grounding, permission-aware retrieval, context compression, and context auditability. Ensure AI systems use the right context, from the right source, with the right permissions, at the right cost, and with measurable quality.
Cost Optimization Frameworks
Design and build cost optimization frameworks for large-scale LLM and GenAI workloads. Create architecture patterns that reduce unnecessary token usage, redundant retrieval, repeated model calls, inefficient inference paths, and avoidable infrastructure spend. Drive model routing, token budgeting, prompt compression, context pruning, semantic caching, response caching, batch inference, async execution, fallback strategies, and cost telemetry across AI workflows. Ensure cost optimization does not compromise quality, safety, grounding, reliability, or user experience.
Training & Research Infrastructure
Collaborate with research and training infrastructure teams to support large-scale model training and post-training workflows. Contribute to architecture around distributed training, checkpointing, orchestration, fault tolerance, observability, data movement, evaluation infrastructure, and experiment velocity. Support frontier model workflows across pre-training, post-training, reinforcement learning, agent training, evaluation harnesses, and large-scale experiment execution.
Release Safety, Validation & Evaluation Gates
Architect validation and release systems that ensure model updates, inference engine changes, runtime images, prompt changes, context changes, and platform releases are correct, safe, performant, and regression-free. Define release gates across correctness, numerical stability, latency, throughput, token usage, cost regression, context quality, retrieval quality, safety behavior, reliability, and model output quality. Ensure platform optimizations do not reduce safety, grounding, quality, or user trust.
Reliability, Observability & Production Operations
Design systems that make AI infrastructure observable, debuggable, reliable, and operationally safe. Define telemetry, tracing, dashboards, alerts, logs, profiling views, runbooks, SLOs, and post-incident learning loops. Provide visibility into prompts, context payloads, retrieved sources, token consumption, model selection, cache behavior, inference latency, GPU utilization, evaluation scores, safety events, cost, and failures. Turn production issues into stronger platform abstractions, safer rollout mechanisms, better automation, and more reliable infrastructure.
Agentic & Multimodal Platform Internals
Support architecture for AI agents, tool use, memory, function calling, multimodal interaction, long-running workflows, and internal or external agent deployment. Work across agent harnesses, evaluation pipelines, workflow orchestration, safety controls, state management, tool execution, memory systems, and product-facing runtime constraints. Ensure agentic and multimodal systems are reliable, observable, secure, cost-aware, and safe under real workloads.
Technical Leadership
Work closely with Research, Inference, Runtime, Infrastructure, Product, Safety, Security, Technical Success, and Deployment teams. Act as a senior technical authority who can cut across layers, resolve ambiguity, identify systemic risks, and drive architecture decisions. Mentor engineers and technical leads on distributed systems, performance engineering, context engineering, cost optimization, production readiness, AI platform design, and architecture trade-offs. Represent architecture decisions through design docs, RFCs, diagrams, technical reviews, operational plans, and leadership-level summaries.

BASIC QUALIFICATIONS

Required Qualifications:

- 10-12 years of experience in software engineering, systems architecture, ML infrastructure, distributed systems, platform engineering, inference systems, cloud infrastructure, or large-scale backend engineering
- Strong hands-on engineering experience with Python and at least one systems/backend language such as C++, Go, Rust, Java, or TypeScript
- Deep understanding of distributed systems, production infrastructure, reliability engineering, scalability, observability, and fault-tolerant architecture
- Experience designing or operating large-scale systems involving APIs, microservices, distributed compute, orchestration, job scheduling, caching, high-availability infrastructure, and production monitoring
- Strong understanding of AI/ML systems, especially model serving, inference workflows, context engineering, retrieval systems, evaluation pipelines, and production model deployment
- Practical understanding of GPU systems, accelerator-based workloads, CUDA/Triton-style programming, distributed inference, GPU profiling, memory optimization, and communication libraries such as NCCL or RCCL
- Experience with ML frameworks and serving stacks such as PyTorch, JAX, TensorFlow, Triton, vLLM-style serving, Apache Ray, Kubernetes-based serving, or internal model-serving systems
- Ability to debug complex problems across model behavior, runtime systems, distributed infrastructure, networking, GPU execution, context quality, retrieval quality, evaluation harnesses, and production services
- Strong communication skills with the ability to write clear architecture documents, evaluate trade-offs, review implementation quality, and align teams around technically sound decisions

PREFERRED QUALIFICATIONS

- Experience working on LLM inference, multimodal inference, agent infrastructure, AI assistants, coding agents, or frontier-model serving platforms
- Experience with tensor parallelism, pipeline parallelism, model sharding, KV-cache optimization, batching, speculative decoding, streaming inference, and long-context serving
- Experience designing context engineering platforms, prompt/version management systems, model-routing frameworks, semantic caching layers, token-budgeting systems, or LLM cost dashboards
- Experience profiling GPU workloads using Nsight Systems, Nsight Compute, rocprof, perf, Prometheus, Grafana, OpenTelemetry, or custom profiling systems
- Experience with large-scale distributed training, RL infrastructure, checkpointing, ML compiler optimizations, model graph transformations, or training runtime systems
- Experience designing release gates, regression detection systems, canary systems, CI/CD validation frameworks, and production safety controls for performance-sensitive infrastructure
- Experience with evals, model quality measurement, hallucination detection, grounding evaluation, safety testing, and model behavior monitoring

TECHNICAL SKILL AREAS

AI Systems: LLM serving, inference runtime, training infrastructure, post-training workflows, agent systems, multimodal models

Inference: batching, routing, KV-cache, streaming, latency optimization, model serving, tensor parallelism, pipeline parallelism

Performance Engineering: CUDA, Triton, GPU profiling, kernel optimization, memory bandwidth, communication libraries, distributed execution

Context Engineering: prompt architecture, dynamic context assembly, RAG, memory, context compression, context ranking, source grounding, permission-aware retrieval

Cost Optimization: token budgeting, caching, model routing, fallback strategies, cost telemetry, batching, async workflows, cost-quality trade-offs

Distributed Systems: scheduling, orchestration, reliability, fault tolerance, observability, scalability, service design

ML Frameworks: PyTorch, JAX, TensorFlow, Triton, vLLM-style serving, Ray

Infrastructure: Kubernetes, Docker, Terraform, CI/CD, cloud platforms, Linux systems, networking, storage

Safety & Validation: evals, release gates, canaries, regression testing, model behavior validation, rollout safety

CANDIDATE PROFILE

The ideal candidate is a senior hands-on architect who can operate across the full AI systems stack. They should be able to discuss GPU memory bottlenecks, distributed inference, model-serving reliability, context quality, cost optimization, release validation, eval pipelines, observability, and production rollout with engineering teams, while also explaining architecture decisions clearly to senior leadership. The candidate should not be limited to architecture diagrams. They must be capable of reviewing implementation quality, identifying bottlenecks, debugging production issues, challenging weak assumptions, and converting repeated failures into stronger platform abstractions. This role requires the judgment of a senior architect, the debugging mindset of a systems engineer, and the ownership mindset required for production AI infrastructure.

Accellor

Technical Architect

San Francisco, California

New 6h ago

Job functions:

Software Engineering

Cloud & DevOps

Technical Product & Program Management

Data Science & Analytics

10+ yrs exp.

On-Site

None

Full Time

Job Description

INTRODUCTION

Accellor is an AI-native services firm purpose-built for the post-ChatGPT era. Free from legacy constraints, we focus on delivering measurable business outcomes through advanced AI, data, and engineering capabilities. Our mission is to operationalize AI at scale and unlock sustained enterprise value. Our offerings span AI solutions, data services, enterprise applications, and product engineering, tailored to industry-specific needs across healthcare, life sciences, telecom, retail, financial services, and technology. By leveraging design thinking and technology-agnostic architectures, we ensure faster time-to-value and seamless interoperability. With a proven track record of enabling Fortune 100 enterprises and global innovators, Accellor stands as a trusted partner for organizations seeking to harness the full potential of AI. Our vision is clear: to build intelligent, connected ecosystems that deliver measurable outcomes and redefine the future of enterprise transformation.

ROLE

Technical Architect — AI Systems & Platform Internals

Experience: 10-12 Years
Role Type: Technical Architect / Staff-Level Systems Architect

Accellor is looking for a Technical Architect — AI Systems, Inference & Platform Internals to help design, scale, and optimize the systems that power ChatGPT, OpenAI API, Codex, agentic systems, multimodal experiences, and internal research workloads. This role is focused on the internal AI systems stack, including inference runtime, model serving, GPU infrastructure, distributed systems, context engineering, cost optimization, evaluation gates, observability, release safety, and production reliability. The ideal candidate is a senior hands-on architect who can reason across the full AI platform — from GPU-level performance and distributed inference to product-scale reliability, model deployment, safety, and cost-efficient operations.

Key Responsibilities:

- AI Systems Architecture
Design and evolve large-scale AI systems that support ChatGPT, OpenAI API, Codex, agentic workflows, multimodal models, and research workloads. Define architecture across inference runtime, model serving, request routing, batching, KV-cache handling, GPU scheduling, distributed execution, observability, release gates, and production rollout. Own technical trade-offs across latency, throughput, reliability, correctness, safety, scalability, cost, and infrastructure efficiency.

Inference Runtime & Model Serving
Architect high-throughput, low-latency inference systems across large-scale GPU clusters. Work across inference engines, serving layers, scheduling systems, caching, streaming, deployment pipelines, and runtime optimization. Partner with engineering teams to improve model-serving efficiency, tail latency, GPU utilization, memory efficiency, correctness under load, and cost per request. Guide architecture decisions involving PyTorch, JAX, Triton, vLLM-style serving, CUDA/Triton kernels, distributed inference, tensor parallelism, pipeline parallelism, model sharding, and long-context serving.
GPU, Kernel & Distributed Performance
Analyze and improve performance across GPU kernels, memory movement, collective communication, orchestration, and runtime scheduling. Guide engineering decisions involving CUDA, Triton, NCCL/RCCL, GPU profiling, memory pressure, compute utilization, tensor layouts, interconnect behavior, and distributed execution. Identify system-level bottlenecks across compute, memory, networking, scheduling, model execution, and data movement.
Context Engineering
Design and guide context engineering frameworks that determine what information should be passed to the model, how it should be structured, how much context should be used, and how context quality should be measured. Own architecture patterns for prompt structure, dynamic context assembly, retrieval-augmented generation, long-context management, conversation memory, tool context, agent state, multimodal context, source grounding, permission-aware retrieval, context compression, and context auditability. Ensure AI systems use the right context, from the right source, with the right permissions, at the right cost, and with measurable quality.
Cost Optimization Frameworks
Design and build cost optimization frameworks for large-scale LLM and GenAI workloads. Create architecture patterns that reduce unnecessary token usage, redundant retrieval, repeated model calls, inefficient inference paths, and avoidable infrastructure spend. Drive model routing, token budgeting, prompt compression, context pruning, semantic caching, response caching, batch inference, async execution, fallback strategies, and cost telemetry across AI workflows. Ensure cost optimization does not compromise quality, safety, grounding, reliability, or user experience.
Training & Research Infrastructure
Collaborate with research and training infrastructure teams to support large-scale model training and post-training workflows. Contribute to architecture around distributed training, checkpointing, orchestration, fault tolerance, observability, data movement, evaluation infrastructure, and experiment velocity. Support frontier model workflows across pre-training, post-training, reinforcement learning, agent training, evaluation harnesses, and large-scale experiment execution.
Release Safety, Validation & Evaluation Gates
Architect validation and release systems that ensure model updates, inference engine changes, runtime images, prompt changes, context changes, and platform releases are correct, safe, performant, and regression-free. Define release gates across correctness, numerical stability, latency, throughput, token usage, cost regression, context quality, retrieval quality, safety behavior, reliability, and model output quality. Ensure platform optimizations do not reduce safety, grounding, quality, or user trust.
Reliability, Observability & Production Operations
Design systems that make AI infrastructure observable, debuggable, reliable, and operationally safe. Define telemetry, tracing, dashboards, alerts, logs, profiling views, runbooks, SLOs, and post-incident learning loops. Provide visibility into prompts, context payloads, retrieved sources, token consumption, model selection, cache behavior, inference latency, GPU utilization, evaluation scores, safety events, cost, and failures. Turn production issues into stronger platform abstractions, safer rollout mechanisms, better automation, and more reliable infrastructure.
Agentic & Multimodal Platform Internals
Support architecture for AI agents, tool use, memory, function calling, multimodal interaction, long-running workflows, and internal or external agent deployment. Work across agent harnesses, evaluation pipelines, workflow orchestration, safety controls, state management, tool execution, memory systems, and product-facing runtime constraints. Ensure agentic and multimodal systems are reliable, observable, secure, cost-aware, and safe under real workloads.
Technical Leadership
Work closely with Research, Inference, Runtime, Infrastructure, Product, Safety, Security, Technical Success, and Deployment teams. Act as a senior technical authority who can cut across layers, resolve ambiguity, identify systemic risks, and drive architecture decisions. Mentor engineers and technical leads on distributed systems, performance engineering, context engineering, cost optimization, production readiness, AI platform design, and architecture trade-offs. Represent architecture decisions through design docs, RFCs, diagrams, technical reviews, operational plans, and leadership-level summaries.