My Role
Product Design Lead
Skills
Product Design
User Testing & Research
Hackathon & Workshops Facilitation
Impact
~ 10 million dollars
of monthly cost reductions in server repairs
75.82%
Decrease of Closed Tickets Without Repeats
6%
Reduction in Employee Churn
At Meta, we’re constantly pushing the boundaries of Generative AI and VR, but scaling these innovations requires a rock-solid server infrastructure. Reliability and capacity aren’t just goals—they’re necessities.
That’s where we faced a challenge. Our servers were struggling to keep up, leading to frequent crashes and disruptions. Engineers were caught in an endless loop of fixes, slowing progress and creating frustration across teams.
To break the cycle and build a future-ready system, we took a deep dive into the problem, uncovering the root causes and designing a solution to keep our infrastructure running smoothly at scale.
83
Internal Facebook Group Posts Read
8
Engineers Interviewed
16
On-Site Technicians Interviews
Engineers are constantly reinventing the wheel when troubleshooting server issues due to a lack of easily accessible and searchable documentation of past problems and solutions.
The lack of a structured system for capturing and sharing knowledge creates inefficiencies and discourages contributions, resulting in repeated work and lost expertise.
Wasted Time
Server Downtime
Engineer Burnout
Higher Costs
Over three months of discovery, we explored solutions for the core challenge and ways to seamlessly connect multiple systems—ticketing, FSA, and the Repair Console. Throughout this process, we engaged directly with our users, who were also key stakeholders, to gain deeper insights and refine our approach.
Challenge 1 | Action Plan Repository
Consistency, standards, and self-service
How might we help repair experts enforce best practices and create workflows that can quickly evolve and scale with our operations teams?
Metrics
Deprecate wiki runbooks
Decrease number of undiagnosed tickets
Challenge 2 | Ticketing Platform
The Great Jira Escape: We're Plotting Our Freedom
How can we eliminate Jira while ensuring tickets are efficiently assigned to the right person?
Metrics
Decrease mean time to resolve
Decrease number of undiagnosed tickets
Increase the accuracy of repair
Challenge 3 | Global Monitoring
Creating a global perspective
How might we bring all the data around all the problems together in one place so that we can recognize patterns earlier and fix problems faster.
Metrics
Decrease mean time to resolve
Decrease number of undiagnosed tickets
Adapt, adapt, adapt
In business, constant change taught me the value of quick thinking and flexibility, emphasizing the need to stay agile and ensure our vision and design adapt to shifting priorities.
Do more than meet in the middle
I ensured we understood each stakeholder's and team member's expectations, meeting them where they are to collaboratively achieve our shared goals.
Remember to pause
Even with rapid deadlines and frequent changes in direction, I often had to remember that taking a step back helps me perform better.