End to end and smoke tests give a really valuable angle on what the app is doing and can warn you about failures before they happen. However, because they’re working with a live app and a live database over a live network, they can introduce a lot of flakiness. Beyond just changes to the app, different data in the environment or other issues can cause a smoke test failure.
How do you handle the inherent flakiness of testing against a live app?
When do you run smokes? On every phoenix branch? Pre-prod? Prod only?
Who fixes the issues that the smokes find?
I think the reality is that there are lots of different levels of tests, we just don’t have names for all of them.
Even unit tests have levels. You have unit tests for a single function or method in isolation, then you have unit tests for a whole class that might set up quite a bit more mocks and test the class’s contract with the rest of the system.
Then there are tests for a whole module, that might test multiple classes working together, while mocking out the rest of the system.
A step up from that might be unit tests that use fakes instead of mocks. You might have a fake in-memory database, for example. That enables you to test a class or module at a higher level and ensure it can solve more complex problems and leave the database in the state you expect it in the end.
A step up from that might be integration tests between modules, but all things you control.
Up from that might be integration tests or end-to-end tests that include third-party components like databases, libraries, etc. or tests that bring up a real GUI on the desktop - but where you still try to eliminate variables that are out of your control like sending requests to the external network, testing top-level window focus, etc.
Then at the opposite extreme you have end-to-end tests that really do interact with components you don’t have 100% control over. That might mean calling a third-party API, so the test fails if the third-party has downtime. It might mean opening a GUI on the desktop and automating it with the mouse, which might fail if the desktop OS pops up a dialog over top of your app. Those last types of tests can still be very important and useful, but they’re never going to be 100% reliable.
I think the solution is to have a smaller number of those tests with external dependencies, don’t block the build on them, and look at statistics. Sound an alarm when a test fails multiple times in a row, but not for every failure.
Most of the other types of tests can be written in a way to drive flakiness down to almost zero. It’s not easy, but it can be doable. It requires a heavy investment in test infrastructure.