Handbook

Principles

As a technology team we have four principles that we align ourselves too. These came our of original Maturity Model of 2017 which included: Maintainability, Testing, Delivering Fast and Security.

Monitor All the Things

At mx51, we highly value the observability of our platform at all the layers. This primarily consists of logs and metrics.

Logs are generated from various components and layers including infrastructure, application and CI/CD tools and are aggregated in our logging tool. mx51’s software engineers are encouraged to log as much as feasible (and within compliance needs).

The infrastructure running in the cloud also generates a lot of infrastructure metrics that are monitored and alarmed.

The application layer similarly generates a stream of Application Performance Metrics that are also monitored and alarmed. Application developers are encouraged to create custom metrics for their components that allows for the monitoring of platform functionality and alerting when incorrect or abnormal behaviour is observed.

Observability tools are one of the few components that run centralised across tenants.

It is the responsibility of every developer to generate sufficient logs and metrics from the code and wire it into the central aggregation tools. The aggregation, monitoring and alerting tools are owned and managed by the infrastructure team.

Security First, Shift Left

We constantly work to shift security left into our work rather than being a traditional afterthought.

Security is incorporated into the technical design of any solution at mx51. The security team helps the product engineering team to design, build and maintain a secure product platform. Staff receive secure code training, systems are pen tested and systems are continually monitored for vulnerabilities and exploits.

The most important facets of product security are encryption, auth/access and threat management.

In the product cloud, encryption of any data in-transit and on disk is the default design option. Furthermore, any data that is identified as Personal Identifiable Information (PII) is also encrypted at the application layer such that if a database dump or a file export is leaked, it would not be in the clear.

As for authentication and authorisation, the least privilege access principle is applied throughout.

For anything at the infrastructure layer, the product engineers use multi-factor authentication to gain access.

For the application layer, a user management sub-system has been specially built for and incorporated into the product. It is maintained by the backend team in collaboration with the security team. Administrator users can provision more users. Users are provisioned with various levels of access using the least privilege access principle. Similarly, users can self-provision API keys with various levels of access so that certain functionality can be automated against APIs instead of accessed manually from the portals.

Lastly, threat modelling occurs early in the product development lifecycle to ensure the right defences are put in place. All assets in the product cloud are continuously monitored for threats, including vulnerabilities that could lead to compromise and activity indicative of compromise.

Release Fast, Fix Forward

The releases and deployments of upgraded or new technical components are properly managed through versioning, automation, checklists, monitoring and auditing.

Every component is properly versioned in source control, being git, as well as in artefact repositories, mainly ECR for container images. Versions are immutable.

For cloud side components, whether they’re at the infrastructure level using Infrastructure as Code, or at the application level using Docker containers, deployment is scripted and automated using CI/CD. We’re constantly working to reduce human touch points.

Changes are first tested in internal, “lower” environments, before being deployed to the tenants’ environments.

Changes are meant to:

  • Cause no outage – Any users of the system should not perceive any downtime
  • Cause no in-flight errors – Any requests in flight during the deployment should not be interrupted
  • Be backwards compatible – When upgrading a component, any clients of that component should continue working without requiring an immediate upgrade, aka non-breaking changes. Some examples are:
    • Upgrading a database schema, the older version of the application should still be able to operate normally with the new schema
    • Upgrading an internal microservice, any client microservices should not break. Designing gRPC specs thoughtfully makes this possible.
    • Upgrading a web API, such that clients that have no idea about the upgrade are not impacted – non-breaking changes
    • Upgrading the in-store protocol on the payment terminal, such that POS still using older versions of the client libraries are not impacted
  • Ability to rollback – Rolling back a component should rarely be needed. It is always preferred to roll forward instead, i.e. creating a fix for the problem at hand and deploying that one. In the rare case that a problem is observed soon after a deployment, that is impacting availability, integrity or compliance, and that can’t easily be fixed by rolling forward, then a rollback is executed by going back to the previous version of the component.

On the rare occasion where the above cannot be achieved, a deployment is handled in a very careful ad-hoc manner, consulting with necessary internal and external parties as necessary.

Releases and deployments are highly visible and auditable through the version control and release management systems that are used, and are integrated with observability monitors/outputs such as Slack.

For some changes such as application component deployments, as an extra precaution, a lightweight manual checklist is created and peer reviewed before deploying to an environment.

Change management is meant to help us move faster, not slower, by providing plenty of safety nets. The intention is that we are able to deploy updated components daily. If we didn’t have a close call every now and then, we wouldn’t be moving as fast as we could be.

Proper change management helps with Availability, Integrity, Compliance and Cost Effectiveness

Value Quality Over Quantity

We value easy to understand well written and maintainable code over clever code. We re-use, don’t repeat ourselves, keep things simple and enjoy deleting code – the world has enough. Code comments shouldn’t be needed, components are loosely coupled, functions have clear purpose, artifacts are immutable, we avoid building things and all code is peer reviewed. We encourage collaboration between engineers to reach the right solution for the job. We leave our egos at the door.

Each engineering discipline is in charge of testing their own work using best practice for that discipline. Such tests are typically automated tests wired into CI/CD pipelines. Typical tests include:

  • Unit Testing
  • UI Testing
  • Database Integration and migration testing
  • End-to-End testing
  • Performance Testing
  • Security/Pen Testing

Additionally, there is a team that is dedicated to “traditional” end-user QA. At a high level this team does two lines of testing:

  • New Product / Feature testing – as mx51 builds out the product suite further, new features are QA’d typically independently of specific tenants, for example in the eng-qa environment, and in the demo tenancy.
  • Tenant Integration Testing – when a new tenant is being set up with an mx51 tenancy, or when an existing tenant is turning on a new feature, the QA team will make sure that the platform works as expected when configured and/or integrated for that tenant.

We’re not perfect, but every day we try to lift the bar.