Sign In
 [New User? Sign Up]
Mobile Version

Site Reliability Operations Engineer

Walmart


Location:
Sunnyvale, CA 94086
Date:
02/10/2018
2018-02-102018-03-23
Walmart
Apply on the Company Site
  •  
  • Save Ad
  • Email Friend
  • Print
  • Research Salary

Job Details

986318BRReq ID:986318BRCompany Summary:Walmart Global eCommerce is comprised of Walmart.com, VUDU, SamsClub.com, and our technical powerhouse @WalmartLabs. Here, innovators incubate next gen e-commerce solutions in real-time. We integrate online, physical, and mobile shopping experiences for billions of customers around the globe. How do we do it? We continuously build and invest in new technology including open source tools and big data innovations. Data scientists, front and back-end engineers, product managers, and web and UX/UI teams collaborate alongside e-commerce experts to envision, prototype, and bring revolutionary ideas to life in a dynamic, flexible and fun work culture.Job Title:Site Reliability Operations EngineerPosition Summary:As a Site Reliability Operations Engineer within the Global Technical Engineering Operations (GTEO) SRC team you will work with other SRC, TDO, SRE, DevOps and Engineering practitioners to pro-actively maintain mission-critical infrastructure, cloud platforms, micro-services, tools, and processes that will ensure highest levels of availability and reliability of all our websites.



You're right for the job if you are comfortable contributing to major incident response in technical team of engineer's laser focused on restoring service across complex distributed architectures. You'll excel if you have enthusiasm for digging deep, and a flare for sharp technical communication, prioritization and organization. You will work directly with our SRE, Engineering and DevOps teams to support our next generation \"always up\" cloud based e-commerce platform.City:SUNNYVALEState:CAPosition Description:As a Site Reliability Operations Engineer within the Global Technical Engineering Operations (GTEO) SRC team you will work with other SRC, TDO, SRE, DevOps and Engineering practitioners to pro-actively maintain mission-critical infrastructure, cloud platforms, micro-services, tools, and processes that will ensure highest levels of availability and reliability of all our websites.



You're right for the job if you are comfortable contributing to major incident response in technical team of engineer's laser focused on restoring service across complex distributed architectures. You'll excel if you have enthusiasm for digging deep, and a flare for sharp technical communication, prioritization and organization. You will work directly with our SRE, Engineering and DevOps teams to support our next generation \"always up\" cloud based e-commerce platform.Minimum Qualifications:- Understanding of incident management processes and procedures.

- Calm under pressure when participating in major incident response.

- Technical understanding of core infrastructure, cloud services, platforms and micro-services.

- Ability to understand and capture key data from logs.

- Ability to understand traffics flows and key dependencies between services.

- Ability to effectively triage - be able to detect and determine symptom vs cause.

- Detect and quantify impact.

- Analyze trends to pro-actively prevent incidents.

- Focus on immediate restoration vs root cause.

- Research and recommend alternative actions for incident resolution - Develop procedures and documentation to support this.

- Create and maintain procedural documentation.

- Identify and drive continuous improvement efforts to reduce waste (eliminate, automate or streamline).

- Absorb knowledge and understand complex distributed systems - ability to share and impart this knowledge into your peer group and beyond.

- Build tools to improve visibility, pro-actively detect issues and restore system availability.

- Develop automation and self-healing with DevOps, Engineering and SRE partners.

- Strong focus on collecting and inferring metrics.

- Clear communication skills.

- Ability contribute to multiple incidents at any given time.

- Analyzes systems and makes recommendations to prevent possible problems. Takes lead on issue resolution activities using knowledge of complex and company-wide systems.

- Scripting and software development to automate and help enhance existing solutions.



Additional responsibilities may include:



- Actively provide data for and participate in root cause analysis.

- Adhere to SRC onboarding process when accepting new systems into service.

- Share knowledge globally between SRC teams.

- Analyze systems and make recommendations to prevent possible incidents.

- Strive for continuous improvement and make recommendations based on SRC process.

- Other duties and responsibilities as assigned.Category:Information Technology Division:Walmart LabsDivision Summary:@WalmartLabs is the technical powerhouse behind Walmart Global eCommerce. We employ big data at scale -- from machine learning, data mining and optimization algorithms, to modeling and analyzing massive flows of data from online, social, mobile and offline commerce. We dont just engineer cool websites, mobile apps, and new services; we use our own open source tools to create the framework. Deployment is automated and accelerated through our open cloud platform. This makes us incredibly nimble and able to adjust in real-time to our global customers.Employment Type:Full TimeRequisition Template:eCommerce
Apply on the Company Site
Powered ByLogo

Featured Jobs[ View All ]

Featured Employers [ View All ]