In which we balance fixing things and going faster.
What Have I Broken? #
The test suites running against a minimal game database were relatively stable, but when I started up the mono-based game server against the production database, it would eventually stop running. Sometimes logging in a player would work, sometimes they’d be greeted with a blinking cursor and nothing more. There wasn’t even a useful exception thrown that would show up in the output.
...
06/14/2023 21:33:51: NPCs: [1741] | Players: [0] | CPU: [??%] | Rnd: [83]
06/14/2023 21:39:19: NPCs: [1741] | Players: [0] | CPU: [??%] | Rnd: [89]
...hours pass with nothing...
A Sea of Timers #
Given all of the changes in environment, I was concerned about the timing. It had already become slow to start up after I started pushing so much data out to the database. The functioning of the game server depends on a large set of repeated timers:
- RoundEvent, 5 seconds
- InactivityEvent 10 seconds
- JanitorEvent, 10 seconds
- UpdateServerStatus. 30 seconds
- SaveEvent, 30 seconds
- World.ShiftDailyCycle 1800 seconds
- World.ShiftLunarCycle 7200 seconds
- For each Character:
- RoundEvent, 5 seconds
- ThirdRoundEvent, 15 seconds
- For each NPC:
- NPCEvent, 5 seconds
- For each Effect:
- AreaEffectEvent, 5 seconds
- CharacterEffectEvent, 5 seconds
If everything goes according to plan, all of these go off after some multiple of the time for one
round (currently 5 seconds). Let’s make sure we understand how the RoundEvent
is going, and track
its incrementing of the round counter.
|
|
|
|
06/14/2023 22:44:44: {SystemGo} Starting main game loop.
> 06/14/2023 22:44:49: {SystemGo} RoundEvent starts with round = 0. SignalTime was 06/14/2023 22:44:49
> 06/14/2023 22:48:35: {SystemGo} RoundEvent ends with round = 1. SignalTime was 06/14/2023 22:44:49,
this RoundEvent took 00:03:45.9208470.
> 06/14/2023 22:48:35: {SystemGo} RoundEvent starts with round = 1. SignalTime was 06/14/2023 22:48:35
> 06/14/2023 22:48:35: {SystemGo} RoundEvent ends with round = 2. SignalTime was 06/14/2023 22:48:35,
this RoundEvent took 00:00:00.0013760.
...
> 06/14/2023 23:02:09: {SystemGo} RoundEvent starts with round = 208. SignalTime was 06/14/2023 23:02:09
> 06/14/2023 23:02:09: {SystemGo} RoundEvent ends with round = 209. SignalTime was 06/14/2023 23:02:09,
this RoundEvent took 00:00:00.0039720.
> 06/14/2023 23:02:14: NPCs: [1722] | Players: [0] | CPU: [??%] | Rnd: [209]
06/14/2023 23:02:14: {SystemGo} RoundEvent starts with round = 209. SignalTime was 06/14/2023 23:02:14
> 06/14/2023 23:02:14: {SystemGo} RoundEvent ends with round = 210. SignalTime was 06/14/2023 23:02:14,
this RoundEvent took 00:00:00.0032490.
...
And it just keeps going; initially, the log messages seem to have resolved the issue, which is unfortunate. On a later run, however, I see an issue:
06/15/2023 00:37:17: {SystemGo} Starting main game loop.
06/15/2023 00:37:22: {SystemGo} RoundEvent starts with round = 0. SignalTime was 06/15/2023 00:37:22
06/15/2023 00:37:27: {SystemGo} RoundEvent starts with round = 1. SignalTime was 06/15/2023 00:37:27
Five seconds into round 0, before it ends (we do a large write to the database in the first round,
which takes more than five sconds), we see round 1 starting. The RoundEvent
being called when it
is already in progress seems like the behavior to expect when
Timer.AutoReset
is set to true
, but I’m not sure why I’m not seeing it all of the time.
One solution is to set Timer.AutoReset
to false
and restart the timer manually when the event
is done. At the end of the RoundEvent
we’ll set the timer’s interval based on how much time we’ve
already taken. Let’s try that:
|
|
|
|
|
|
This appears to work: the closest the RoundEvent
s come is every five seconds, and any slowness
just delays the next one. On the production database, the server survives 1000+ rounds with no
freezing.
One Timer to Rule Them All #
Since all of the other timers are multiples of the main round timer, why are they all firing separately? I have to assume this is for performance: having these many timer events fire separately, they each get a separate thread out of the thread pool and run somewhat concurrently. Massive credit due to the original authors: it worked well, and the server was very reliable given all of the moving parts.
However, it looks like my transition to Mono has surfaced some differing timer behavior. Also, I’d
like to be able to tweak the round timing a bit more, for testing and simulation purposes. So let’s
have the RoundEvent
manually call each of the other timer events instead:
|
|
Tripping Over a Wizard Eye #
The code that calls the NPC’s NPCEvent
each round looks kind of different; what’s up?
In testing these changes, I ran into a few failures. Some tests take into account the healing a character gets every three rounds; the “third round” has shifted, so these needed to be tweaked. More interestingly, the two tests that involve the spell “Wizard Eye” were failing badly.
|
|
## Wizard_Eye
Scenario: Wizard Eye potion results in OOBE # features/player_effects_2.feature:449
Given I use the "minimal" database as-is # features/steps/server_steps.rb:3
And I add player and character "TestHS01" # features/steps/account_steps.rb:12
And I create an "Wizard_Eye" potion of amount "0" and duration "2" # features/steps/character_steps.rb:46
And I put the item in the character's left hand # features/steps/character_steps.rb:59
And I start the server and play # features/steps/server_steps.rb:45
When I open and drink "bottle" # features/steps/character_steps.rb:81
Then I saw myself # features/steps/character_steps.rb:448
And I see my doppelganger in a trance # features/steps/character_steps.rb:458
Net::ReadTimeout with "timed out while waiting for more data" (Net::ReadTimeout)
./features/lib/misc_helper.rb:110:in `telnet_command'
./features/steps/character_steps.rb:460:in `/^I see my doppelganger in a trance$/'
features/player_effects_2.feature:457:in `I see my doppelganger in a trance'
And I rest # features/steps/character_steps.rb:394
And I did not see myself # features/steps/character_steps.rb:453
Scenario: Wizard Eye potion effect visible by others # features/player_effects_2.feature:461
Given I use the "minimal" database as-is # features/steps/server_steps.rb:3
And I add player and character "TestHS01" # features/steps/account_steps.rb:12
And I create a "Wizard_Eye" potion of amount "0" and duration "50" # features/steps/character_steps.rb:46
And I put the item in the character's left hand # features/steps/character_steps.rb:59
And I start the server and play # features/steps/server_steps.rb:45
When I open and drink "bottle" # features/steps/character_steps.rb:81
And I add player and character "TestHS02" # features/steps/account_steps.rb:12
And I log on as "TestHS02" # features/steps/login_steps.rb:201
And I enter the game # features/steps/login_steps.rb:277
Net::ReadTimeout with "timed out while waiting for more data" (Net::ReadTimeout)
./features/lib/misc_helper.rb:110:in `telnet_command'
./features/steps/login_steps.rb:278:in `/^I enter the game$/'
features/player_effects_2.feature:470:in `I enter the game'
Then I saw a "toad" # features/steps/character_steps.rb:472
And I see "TestHS01_name" in a trance # features/steps/character_steps.rb:465
Failing Scenarios:
cucumber features/player_effects_2.feature:449 # Scenario: Wizard Eye potion results in OOBE
cucumber features/player_effects_2.feature:461 # Scenario: Wizard Eye potion effect visible by others
Net::ReadTimeout
? Sounds like a crash of some sort, but nothing in the logs and the server
executable keeps running. And why only in these two tests?
As you may guess from the above final code, Wizard Eye leaves us with a
member of the NPC list that’s not actually an NPC; so casting it to NPC
to call NPCEvent()
was
throwing an exception. Since the exception was in the timer event, not on the main thread, it
wasn’t caught. And since it happened before the timer had been reset, the timer never fired again
(hence no response to the test user).
So. Some danger in not using AutoReset
if the timer event is able to crash quietly. And what I’ve
done by checking for non-NPC NPCs is just a bandaid for these two tests. We should handle
exceptions in the one master timer event more gracefully.
Good Enough? #
124 scenarios (124 passed)
1242 steps (1242 passed)
111m55.772s
Yes, but kind of slow.
Do You Want to Go Faster? #
There is one question that every engineer attached to Test or Ops has heard time and time again: “How can we make the tests faster? (but still reliable and correct, of course.) (and with the same resources or less, of course.)” Sometimes it’s possible, sometimes it takes a lot of work, and sometimes it requires changes in the product code.
Whenever considering putting a “test mode” in the product, a lot of caution is called for. It becomes a question of how useful the testing is, given that the code paths being tested are not necessarily those exercised in production. In the past I’ve recommended using such a mode as an initial screening step, especially when test resources are scarce, but using a full production-mode test run for any release.
Turbo Rounds Enabled #
In this case, we’ll implement a speedup that doesn’t rely on communicating that the server is being tested. Regular players will see the same speedup under the right circumstances. Let’s try:
- If there is time left before the round timer would increment,
- and all effects and NPCs have been processed,
- and all players have entered their commands,
- skip to the next round.
Simple enough to describe, with a few interesting details that come up during implementation. Here’s the code I added:
|
|
On each run through the main game loop, after all of the processing, we run through the list of theoretical PCs.
- If it’s really a PC,
- and they’re in the game (
PLAYING
) rather than at a menu or chatting, - and they either:
- have not yet given a command (empty queue, current command weight is 0),
- are Feared (and therefore unable to give commands)
- or are Stunned (ditto)
- then we need to continue to wait for the round timer.
Otherwise, we fast-forward the round timer interval to 1ms. If we’re between round events, it’ll
trigger as soon as this thread hits Sleep()
. If we’re in mid-event, the timer is disabled and
this will have no effect.
Does it work?
data:image/s3,"s3://crabby-images/d70a5/d70a5c41c68f8ea60132ed2e4f77e61eb92c6ded" alt="Petting a dog - original speed"
data:image/s3,"s3://crabby-images/5be29/5be29d34eec3f4fc87e5a618a6e1af444fda64dc" alt="Petting a dog - new speed"
Looks good. I’d bet all those who paid by the minute to play would’ve appreciated it.
Simpler Inter-Test Resets #
Now that we’re in easily-refreshed containers, we don’t really need to clean up after individual tests; we can just bounce the database and/or game server.
|
|
For tests that manually control the execution of the game server, we use @kill_server_after
to
bring up the containers before each test, and take them down and clear out the bound volume after.
For tests that expect the server to be running, we use @reset_server
instead. We were able to
retire such hooks a @restore_config_file
, @db_cleanup
, and @clear_player_effects
.
It’s certainly possible that the “cleaning up” between tests was a little faster, depending on the speed of database calls compared to Docker actions; but the “clean slate” approach certainly feels safer.
Better and Faster Now? #
124 scenarios (1 flaky, 123 passed)
1257 steps (1 failed, 9 skipped, 1247 passed)
83m20.542s
Hmm, a bit faster but now one’s flaky. Need to look at that. Tomorrow.
One More Thing #
Speaking of timing, let’s look at the daily and lunar cycles.
|
|
|
|
|
|
Four moon phases * 90 minutes? 6 hours.
Seven days of the week * four times of day * 30 minutes? 14 hours.
Looks like there’ll need to be some way to set these values in order to test their effects, but just for fun let’s bypass all the timers, scale down the database output, and see how fast we can get through them:
06/15/2023 14:46:30: ShiftDailyCycle: Round 360, Daily cycle is now Afternoon, current day is now Nanna.
06/15/2023 14:48:49: ShiftDailyCycle: Round 720, Daily cycle is now Evening, current day is now Nanna.
06/15/2023 14:51:03: ShiftDailyCycle: Round 1080, Daily cycle is now Night, current day is now Nanna.
06/15/2023 14:53:17: ShiftDailyCycle: Round 1440, Daily cycle is now Morning, current day is now Enki.
06/15/2023 14:53:17: ShiftLunarCycle: Round 1440, Lunar cycle is now Waxing_Crescent, current day is now Enki.
06/15/2023 14:55:32: ShiftDailyCycle: Round 1800, Daily cycle is now Afternoon, current day is now Enki.
06/15/2023 14:57:48: ShiftDailyCycle: Round 2160, Daily cycle is now Evening, current day is now Enki.
06/15/2023 15:00:03: ShiftDailyCycle: Round 2520, Daily cycle is now Night, current day is now Enki.
06/15/2023 15:02:18: ShiftDailyCycle: Round 2880, Daily cycle is now Morning, current day is now Inanna.
06/15/2023 15:02:18: ShiftLunarCycle: Round 2880, Lunar cycle is now Waning_Crescent, current day is now Inanna.
06/15/2023 15:04:33: ShiftDailyCycle: Round 3240, Daily cycle is now Afternoon, current day is now Inanna.
06/15/2023 15:06:49: ShiftDailyCycle: Round 3600, Daily cycle is now Evening, current day is now Inanna.
06/15/2023 15:09:04: ShiftDailyCycle: Round 3960, Daily cycle is now Night, current day is now Inanna.
06/15/2023 15:11:20: ShiftDailyCycle: Round 4320, Daily cycle is now Morning, current day is now Utu.
06/15/2023 15:11:20: ShiftLunarCycle: Round 4320, Lunar cycle is now Full, current day is now Utu.
06/15/2023 15:13:34: ShiftDailyCycle: Round 4680, Daily cycle is now Afternoon, current day is now Utu.
06/15/2023 15:15:49: ShiftDailyCycle: Round 5040, Daily cycle is now Evening, current day is now Utu.
06/15/2023 15:18:05: ShiftDailyCycle: Round 5400, Daily cycle is now Night, current day is now Utu.
06/15/2023 15:20:22: ShiftDailyCycle: Round 5760, Daily cycle is now Morning, current day is now Gugalanna.
06/15/2023 15:20:22: ShiftLunarCycle: Round 5760, Lunar cycle is now New, current day is now Gugalanna.
06/15/2023 15:22:38: ShiftDailyCycle: Round 6120, Daily cycle is now Afternoon, current day is now Gugalanna.
06/15/2023 15:24:54: ShiftDailyCycle: Round 6480, Daily cycle is now Evening, current day is now Gugalanna.
06/15/2023 15:27:10: ShiftDailyCycle: Round 6840, Daily cycle is now Night, current day is now Gugalanna.
06/15/2023 15:29:28: ShiftDailyCycle: Round 7200, Daily cycle is now Morning, current day is now Enlil.
06/15/2023 15:29:28: ShiftLunarCycle: Round 7200, Lunar cycle is now Waxing_Crescent, current day is now Enlil.
06/15/2023 15:31:45: ShiftDailyCycle: Round 7560, Daily cycle is now Afternoon, current day is now Enlil.
06/15/2023 15:34:01: ShiftDailyCycle: Round 7920, Daily cycle is now Evening, current day is now Enlil.
06/15/2023 15:36:18: ShiftDailyCycle: Round 8280, Daily cycle is now Night, current day is now Enlil.
06/15/2023 15:38:35: ShiftDailyCycle: Round 8640, Daily cycle is now Morning, current day is now Nanna.
A week in less than an hour isn’t bad, but still not fast enough for a speedy test cycle.