Monitoring
Monitoring is about automatically keeping an eye on our system to ensure it is healthy and performing as expected, providing a good experience for our customers.
What constitutes a "good experience" for the Control Plane is significantly different to the Data Plane. The Data Plane is primarily concerned with reliably providing high throughput and low latency. The Control Plane is more administrative, aiming to help customers understand and manage their resources and costs easily and efficiently.
These concerns can be grouped into operational and product monitoring.
Operational Monitoring
Standard operational metrics form an essential foundation for a healthy system. The classic operational metrics are availability and latency.
- Availability: The percentage of requests that were handled successfully. This doesn't make judgements about the nature of the request or response, merely that the system didn't fail unexpectedly (i.e. no 5XX errors).
- Latency: The time taken to process requests. All requests should be fast in the typcical case, and not too slow in the worst case.
In practice, we expect these metrics to be stable under very little load. We just need to know what they are to have a sense of our customers' experience with the Control Plane, and to hear about anomalous changes.
Product Monitoring
More attention is instead paid to the user experience, following Google's HEART metrics framework (happiness, engagement, adoption, retention, task success). Since we are strategically interested in acquiring fewer large customers, there is more priority given to task success, happiness, and retention, esepcially early in a customer's life.
Task Success
Key tasks for the Control Plane include:
- Signup
- Accoument management (create, switch)
- Member management (invite, accept, remove)
- Index manangement (create, modify, delete)
- Early index usage (add docs, query)
Happiness
Completing tasks quickly and easily is one way to make customers happy, but they will inevitably have other, less explicit needs, such as confidence that their indexes are working, knowledge derived from documentation, comfort in what they are being billed for, and trust in the product and company. Happiness is a broad indicator of how we're doing across the board.
Typical inputs for happiness come from surveys, either sent via email or displayed on websites, such as the Console ("can you do what you need to?") and documentation ("was this helpful?"). Again, having fewer, larger customers means the sample size for surveys will be low, so we should lean more into qualitative feedback (requests for specific improvements, individual negative sentiments) rather than quantitative outputs.
Retention
The best measure of whether customers are getting value from Marqo Cloud is whether they keep using it, and ideally increase their spend over time. In general customer visiting the Console and documentation more often may indicate higher engagement, but this is not a prerequisite for success; indeed an ideal Marqo Cloud would probably be compeltely hands-off, with timely notifications when action is needed.
Some good measures for retention are:
- Churn rate: % of customers who leave in a period
- Customer Lifetime Value (LTV): monthly average revenue per account (ARPA) divided by churn rate
Again, it's more important that we retain our larger customers who better reflect our perceived target market. In practice we would probably want to weight each account's contribution to these metrics by their suitability (based on external not factors about the businesses and users, not their usage of Marqo).