My Career – Part 20: Niantic: Fixing anti-cheat

I want to describe the anti-cheat system that we built without providing enough specifics that would allow people to bypass it.

The new anti-cheat initiative had the following goals:

We wanted to have confidence that the system was working correctly – my biggest fear was that we’d accidentally punish players who were innocent. Hence, we wanted to collect and correlate different information that both proved both cheating and suspicion.
- Suspicion could be determined by many factors, such as movement within the game (but it didn’t always imply cheating).
- Cheating was determined using some secret sauce (not discussed here).
- If a lot of cheaters were not suspicious, that could indicate a problem.

We wanted to provide general support for anti-cheat signals that all games could rely on, but we also wanted a way for game servers to create their own anti-cheat signals.
- For example, the following might make a player more suspicious:
  - They catch 100 IV Pokemon at higher rates than other players.
  - They have an unusually high amount of perfect throws.
  - Pokemon Go players that visit both Zaragoza, Spain and the island of Kiribati frequently.
- Neither of the above prove that a player is cheating, but it does increase their suspicion.

Players shouldn’t be punished due to suspicion alone, but we could initiate an automated investigation into a player based on suspicion and the results of that investigation could lead to a punishment.

The anticheat rules are easy to manage.
- There are general anticheat rules that are owned by the anti-cheat team and inherited by all games.
- There are game specific anticheat rules that can be managed by the game teams.
- Game teams can override the general rules that they inherit in certain cases.

Transparency.
- We wanted dashboards that allow us to monitor exactly how each signal is working, the punishment rates, etc.
- We can easily correlate signals and punishments to different OS versions, phone models and brands, etc. for easy debugging.
- We have alerts that can tell us when our signal rates or punishment rates have exceeded a threshold.

Limits and guardrails
- They are allowed to set thresholds on the total number of players punished each hour and to limit the rate of punishment.
- If the number of cheaters exceeds a threshold, don’t punish anybody until after we verify that something isn’t broken.

We decided to built two separate anti-cheat systems:

One that collects signals, interprets rules, and issues punishments (cleverly referred to as the “new anti-cheat server”).
One that focuses on performing investigations (which may or may not emit a signal that is later detected by the new anticheat server to issue a punishment) called the Argos server (Argos being the Greek giant with 100 eyes).

I named them both, so I’m not sure why one has a cool name and the other does not (although at one point somebody was calling the new server Hammurabi). I designed them both, wrote the documents, pushed them through the review process, etc., but after getting the initial communication working, I turned over development of the Argos server to another developer.

Mythical creature Argos/Argus – Hammurabi seems lame by comparison

The new anticheat server roughly worked as follows:

As the game runs, it sends RPCs to the game server and this RPC information is stored in a BigTable database.
Every hour the anti-cheat server runs a dataflow which does the following:
1. It looks at the RPC data and generates signals from it.
2. These signals are aggregated for each player.
3. It reads rules from the configuration that indicates when punishments or investigations are warranted.
4. If a punishment or an investigation is warranted, the information is stored in a separate table.
5. Every step above generates metrics that are stored into its own time-series database.
The game server periodically reads the punishment table and updates its own player database to reflect these punishments.
New dashboards read the time-series database so we can view trends or anomalies.

The Argos system is very complex and is sufficiently sensitive that I won’t talk about it too much, but it was coded using C++ (because it had to run on the game server cluster and contained enough sensitive IP that we needed to obfuscate it for 3rd parties). The game server communicated with the Argos server via gRPC.

I started working on the server in earnest right around the time that the pandemic hit. A few months later, Niantic gave me the ultimate gift – they hired two new people to help me on the project! They both started on the same day.

Piaw Na is a very seasoned engineer who was a top performer at Google who seems to know everybody (except for Raymond Chen – Piaw wanted me to introduce him but so far he’s never come up to Seattle). He even once had me on a call with the founder of Databricks (who used to be Piaw’s intern).
Savitha Jayasankar was a junior engineer, but she is super smart and hard working (and she quickly became a senior engineer).

The only time we all met in person (after Piaw and I had already left Niantic)

The three of us spoke for several hours every day, and together we formed the anti-cheat triumvirate. Due to isolation of the pandemic, these calls often felt like therapy (at least for me).

Piaw especially helped on the dataflow performance, and Savitha did most of the real work on everything else (or at least it felt that way). Her first assignment was to collect all of the dataflow metrics and I mistakenly believed that it would be easy using Prometheus. I didn’t take into account that we wanted the events based on the time that they actually occurred (and not when the dataflow later detected the event). Hence, Prometheus didn’t work, and Piaw and Savitha brainstormed and tried various approaches until determining that the Postgres and Grafana combination worked well (other combinations did not). They even received a patent for their work.

On average, the dataflow processed about 700,000 records per hour, but during special events it could easily spike to over a billion records an hour. The dataflow usually ran in about half an hour and it cost roughly $200,000 less to run each month than the existing system.

After the system was running, it was clear that it needed new tooling to allow our operations team to manually punish players, manually un-punish players, and perform their own investigations. As no front-end engineers were available, I had to learn Angular, which I used to create a very crude front end.

Talking Smac

My Career – Part 20: Niantic: Fixing anti-cheat

Leave A Comment Cancel reply