Index Ventures
companies
Jobs

Functional Lead, BSS Operations [DevOps & Platform Maintenance] - BSS Ops Department (BSOPD)

Rakuten

Rakuten

Software Engineering, Operations
Tokyo, Japan
Posted on Nov 27, 2025

Job Description:

Business Overview

The Technology Platforms Division (TPD) drives the growth of Rakuten's ecosystem by delivering innovative, high-quality technology platforms characterized by integrated control and strategic partnerships and responsible for building and operating the infrastructure and ecosystem platforms which power the Rakuten Group.

Department Overview

Our department, BSS Ops Department (BSOPD) provides operational service for BSS applications both B2B & B2C and also responsible for maintenance of IT infra (on-premise and cloud environment) for BSS platform.

Position:

Why We Hire

We are looking for Entrepreneurial, Innovative, Growth-Oriented, and Customer-obsessed individuals to join our growing team to build the Telco of the Future.
We are a truly global organization, with team members from Japan, India, North America, South America, Europe, China, Korea, Australia, Africa, and more, shifting to a fast-paced, agile way of working.

Position Details

- Ensure high availability, resilience, and scalability across multi-region production environments through automation and proactive monitoring.

- Design and maintain CI/CD pipelines (Jenkins, GitLab CI, ArgoCD) to enable continuous delivery for microservice and portal components.

- Build and operate observability frameworks (metrics, logs, and traces) using Dynatrace, Grafana, Prometheus, Splunk, and Kibana.

- Develop and enhance infrastructure-as-code templates (Terraform, Ansible) to manage cloud and on-premise resources consistently.

- Participate in the on-call rotation for critical incidents, lead service restoration, and perform detailed Root Cause Analyses (RCA).

- Collaborate with development, product, and network teams to optimize system performance and stability across Rakuten’s digital ecosystem.

- Implement and track SLOs, SLIs, and SLAs for all critical services to improve reliability and align with business objectives.

- Contribute to post-incident reviews, drive automation for recurring issues, and continuously enhance system resilience.

- Create and maintain runbooks, dashboards, and knowledge base documentation for operational readiness and training.

- Support regular maintenance, feature rollouts, and security patching for production and pre-production environments.

Mandatory Qualifications:

1) Technical Expertise

- Cloud Platforms: Extensive hands-on experience with AWS and/or Rakuten Cloud Platform (RCP) services (e.g., EC2, EKS, S3, IAM, VPC, Route 53).

- Containerization & Orchestration: Strong expertise with Docker, Kubernetes (K8s), and Helm for deploying, scaling, and managing distributed, microservice-based applications. Experience with Helm charts, ConfigMaps, and Secrets management.

- Infrastructure as Code (IaC): Proficiency with Terraform, CloudFormation, or Ansible for automated infrastructure provisioning, configuration management, and drift detection.

- CI/CD Automation: Deep knowledge and hands-on experience designing and implementing automated build and deployment pipelines using Jenkins, GitLab CI/CD, and ArgoCD. Familiarity with Git branching strategies, artifact management (Nexus, Artifactory), and code quality gates (SonarQube). Experience with blue-green and canary deployment strategies.

- Monitoring & Observability: Expert-level experience with Dynatrace, Grafana, Prometheus, ELK Stack (Elasticsearch, Logstash, Kibana), Splunk, and/or New Relic for full-stack visibility, metrics collection, alerting, and dashboard creation.

- Logging & Tracing: Skilled in centralized logging and distributed tracing tools such such as Dynatrace, New Relic, AppDynamics, Jaeger, or OpenTelemetry. Strong understanding of end-to-end observability for diagnosing complex issues.

- Scripting & Automation: Strong proficiency in Python, Shell (Bash), or Go for developing automation scripts, health checks, self-healing mechanisms, and reliability tools.

- Operating Systems: Expert in Linux/Unix administration, including performance tuning, troubleshooting, and security hardening.

- Networking & Security: Solid understanding of TCP/IP, DNS, load balancing, TLS/SSL, firewalls, and identity management (e.g., OAuth2, SSO).

- Incident Management: Proven experience in handling P1/P2 incidents, leading service restoration, performing detailed Root Cause Analyses (RCA), and implementing preventive measures.

- Version Control & Collaboration: Proficient in Git, Bitbucket, and agile collaboration tools like JIRA and Confluence.

2) Domain & Methodological Knowledge

- Telecom BSS/OSS Systems: Strong understanding of Rakuten’s customer-facing portals, CRM, order workflows, and the broader telecommunications BSS/OSS landscape.

- Site Reliability Engineering (SRE): Ability to define and monitor SLOs, SLIs, and SLAs to ensure service reliability and uptime targets. Familiarity with SRE best practices (e.g., Google SRE model) and error budget management.

- Hybrid/Multi-Cloud: Experience managing Kubernetes clusters and deploying applications in hybrid cloud or multi-cloud environments (AWS EKS, Rakuten Cloud Platform).

- Cost Optimization & Capacity Planning: Experience with cost optimization strategies and capacity planning in cloud environments.

- IT Governance: Familiarity with ITIL and ISO 27001 standards.

3) Professional Competencies

- Problem-Solving: Exceptional analytical and troubleshooting capabilities to resolve complex, time-sensitive issues efficiently.

- Communication: Excellent verbal and written communication skills to articulate technical issues to both technical teams and non-technical stakeholders (e.g., business users, L1 support).

- Adaptability: The ability to quickly learn and adapt to new front-end technologies, frameworks, and evolving business processes within a dynamic environment.

- Customer Focus: A strong commitment to ensuring a positive and efficient user experience for both customers and internal agents.

4) Experience & Education

- Bachelor's degree in Computer Science, Information Technology, or a related technical field.

- Typically 8 to 12 years of experience in an L3 or equivalent technical support role, ideally within the telecommunications sector.

- Proven experience with ITSM methodologies and ticketing tools such as ServiceNow or Jira.

Desired Qualifications:

- Proactive approach to problem-solving.

- Strong organizational skills & Experience with budget management.

- Knowledge of industry standards and compliance requirements.

- Ability to work independently and as part of a team.

- Commitment to continuous learning and professional development.

Other Information:

Additional information on Location

Rakuten Crimson House (Head office)

#engineer #developmentsupport #technologyplatformdiv

Languages:

English (Overall - 4 - Fluent)