Sign In
 [New User? Sign Up]
Mobile Version

Site Reliability Operations Engineer

Walmart


Location:
Sunnyvale, CA 94086
Date:
02/01/2018
2018-02-012018-03-04
Walmart
Apply on the Company Site
  •  
  • Save Ad
  • Email Friend
  • Print
  • Research Salary

Job Details

917967BRReq ID:917967BRCompany Summary:Walmart Global eCommerce is comprised of Walmart.com, VUDU, SamsClub.com, and our technical powerhouse @WalmartLabs. Here, innovators incubate next gen e-commerce solutions in real-time. We integrate online, physical, and mobile shopping experiences for billions of customers around the globe. How do we do it? We continuously build and invest in new technology including open source tools and big data innovations. Data scientists, front and back-end engineers, product managers, and web and UX/UI teams collaborate alongside e-commerce experts to envision, prototype, and bring revolutionary ideas to life in a dynamic, flexible and fun work culture.Job Title:Site Reliability Operations EngineerPosition Summary:The SRC Site Reliability Operations Engineer is responsible for pro-actively monitoring, detecting and resolving site issues before they become customer and availability impacting. Technically you will understand the full end to end stack and use this knowledge to detect error/failures and take corrective action to mitigate. During a major incident, you will draw on your technical skills and knowledge to triage, differentiating between symptom and cause, to help restore impacting issues. Your ability to continuously challenge yourself and develop a strong network within your peer group will see you exceed in this role. Our goal is to protect the customer experience and deliver outstanding levels of availability.City:SUNNYVALEState:CAPosition Description:The SRC Site Reliability Operations Engineer is responsible for pro-actively monitoring, detecting and resolving site issues before they become customer and availability impacting. Technically you will understand the full end to end stack and use this knowledge to detect error/failures and take corrective action to mitigate. During a major incident, you will draw on your technical skills and knowledge to triage, differentiating between symptom and cause, to help restore impacting issues. Your ability to continuously challenge yourself and develop a strong network within your peer group will see you exceed in this role. Our goal is to protect the customer experience and deliver outstanding levels of availability.Minimum Qualifications:- 3+ years in an infrastructure, systems, engineering or development environment delivering operational excellence to highly complex distributed systems.

- Bachelor's Degree in Computer Science or a related field, or relevant work experience.

- Strong and demonstrable incident management skills with relevant experience in an enterprise organization.

- Experience and exposure working is a 24/7 operations support environment.

- Methodical and systematic problem solving approach, combined with a solid awareness of ownership, initiative and drive.

- Experience investigating, analyzing and troubleshooting large scale enterprise systems.

- Networking knowledge and understanding of network concepts, such as different protocols (TCP/IP, UDP, ICMP, etc.), MAC addresses, IP packets, DNS, OSI layers, and load balancing).

- Programming experience in one or more of the following languages: Go, Java, Python, Ruby, Shell.

- Experience administering Unix/Linux in a production environment.

- Understanding of Unix/Linux systems from kernel to shell and beyond, taking in system libraries, file systems, and client-server protocols along the way.

- Experience working with and developing enterprise monitoring/tooling solutions like Grafana, Kibana, Splunk, Graphite, Nagios, New Relic, Greylog and HPOM.

- Working knowledge of one or more cloud technologies such as AWS, AZURE OpenStack.Additional Preferred Qualifications:- Actively provide data for and participate in root cause analysis.

- Adhere to SRC onboarding process when accepting new systems into service.

- Share knowledge globally between SRC teams.

- Analyze systems and make recommendations to prevent possible incidents.

- Strive for continuous improvement and make recommendations based on SRC process.

- Other duties and responsibilities as assigned.Category:Software Development and Engineering Division:Global eCommerceEmployment Type:Full TimeRequisition Template:eCommerce
Apply on the Company Site
Powered ByLogo

Featured Jobs[ View All ]

Featured Employers [ View All ]