Job Control/Scheduling

Job control (finding work for an idle hearthling to do) is one of the largest performance problems in games like Stonehearth. The basic problem can be described as; given a huge list of tiny jobs (everything that needs to be crafted, every item that needs to be hauled, every individual block waiting to be mined, etc), calculate a score for each job, and find the job with the highest (“best”) score. For small kingdoms (small number of hearthlings, small number of jobs) it seems easy; but as the kingdom grows (large number of hearthlings , large number of jobs) it becomes a nightmare.

List Splitting by Job Type

The first (and most obvious) step for performance is to use many lists; rather than just having a single “master list”.

You’d have a list of jobs for backsmiths, a list of jobs for builders, a list of jobs for hauling; and a hearthling capable of all 3 job types would have a “job type order” - e.g. you’d search the blacksmith job list first; and if you didn’t find a job you’d search the building job list; and if you still didn’t find a job you’d search the hauling job list.

This gives around 15 different job lists.

For some professions there are multiple jobs. For example, a herbalist has craft jobs (making potions, bandages) and healing jobs; a shepherd has animal feeding jobs and butchering jobs; an engineer has craft jobs (gears, turrets) and repair jobs; a farmer has tilling, planting and harvesting jobs; and so on. These would be treated as separate job types.

This gives around 20 different job lists.

List Splitting by Other Criteria

For hauling and mining jobs the list can become huge (making it more expensive to find a hauling and mining job). To reduce this these job types can be split further. For example, you could use “distance from town center and “approx. town radius””, and have 2 job lists for hauling - one for “near hauling” and one for “far hauling”. Haulers would search the “far hauling” list first, then search the “near hauling” list. For mining it’d be the same but opposite (“near mining” then “far mining”).

Building can be split into block placement, scaffold construction, scaffold destruction and furniture/window placement.

This gives around 30 different job lists.

For example; a farmer might check the “harvesting” list, then the “planting” list, then the “tilling” list, then the “far hauling” list, then the “near hauling” list.

Dependence Peeling

Some jobs have dependencies that you know can’t be satisfied (e.g. require items that don’t exist in the kingdom’s inventory). These jobs should be placed in seperate “dependency lists”; with a list for each type of item they’re waiting for. When an item of that type is added to the kingdom’s inventory you’d check the dependency list for that type of item, and move job/s that were waiting for the item and are now possible into the appropriate job list.

For example, you might have a “make steel sword” job and a “make bronze gear” job on a “waiting for metal bars” dependency list. When the blacksmith creates a new metal bar you’d check the “waiting for metal bars” job list to see if something is satisfied. If the “make bronze gear” job is satisfied you’d move the job from the “waiting for metal bars” list to the “engineer crafting” list.

If a job has 2 or more dependencies (e.g. a stew needs meat and vegetables) and both aren’t satisfied (there’s no meat and also no vegatables) then it doesn’t matter if the job goes on the “waiting for meat” list or on a “waiting for vegatables” list. In this case; if it’s on the “waiting for vegetables” list and vegatables are added to the kingdom’s inventory, you’d shift the job to the “waiting for meat” list (and would not shift it to the “cook crafting” list).

While searching for a job and find that something is no longer satisfied (e.g. a “make bronze gear” job, where the was bronze bars initially but they’re used/sold/gone now) you’d move the job from the “engineer crafting” list to the “waiting for metal bars” list.

Job List Processing

Processing a list mostly involves a “for each job in list” loop containing:

  • initial dependency checking (if the job is missing something it depends on - raw materials, empty stockpile space, etc).
  • calculating a score (e.g. distance from hearthling to item, distance from item to stockpile)
  • doing a “if (score > best_score) { best_job = job; best_score = score; }” thing

For performance; any “distance” calculations should not be literal distances (and shouldn’t involve accurate path finding or “sqrt()”). It is enough to assume distance is “sum of deltas” (or “distance = abs(x1 - x2) + abs(y1 - y2) + abs(z1 - z2)”). Once a “best” job is found, you’d have to double check (using accurate/expensive path finding) to determine if the item/s can actually be reached; and if the items can’t be reached you go back to the process the list again. For this reason the scores for each job should be stored in the job’s entry (so that if you need to retry because an item wasn’t reachable you can do so without recalculating each job’s score).

If a job is found, the items needed for the job need to be marked (or locked) to prevent the items being used for something else (e.g. to prevent the ingredients being eaten while a cook is trying to make a meaty stew); and the job needs to be removed from the job list. This needs to be thread safe (all items needed get locked and the job is removed from the list; or none of the items needed are locked and the job remains in a list). If you fail to acquire the lock for an item (e.g. because someone ate it, it got sold, etc) then you have to unlock the items, put the job on a “dependency list”, then go back to the process the list again.

Thread Safety

With many lists (more lists that you have CPUs/threads; assuming a “one thread per CPU” model and not something silly) it’s enough to have one lock per list; because the chance of 2 threads fighting for the same lock (lock contention) is negligable.

For deadlock prevention you need a lock order. The problem here is mostly for moving jobs from one list to another (where you need 2 locks at once). If one thread locks “list A” then tries to get the lock for “list B”, and a different thread locks “list B” then tries to get the lock for “list A”, then both threads can’t get the locks they need and wait for each other forever. Essentially you need to make sure that locks are acquired in the correct order - e.g. if both threads try to acquire the lock for “list A” and then try to acquire the lock for “list B”, then there’s no problem.

The other problem is job cancellation. If a job is cancelled, you want to avoid finding and then locking, and then modifying whichever list the job happens to be on. The way to do this is to only set a “job cancelled” flag in the job, and remove the job from whichever list it’s on when that list is being processed. Note that C++ has “atomics” so it should be trivial to set a “cancelled” flag without acquiring any lock. The “cancelled” flag would need to be checked when processing job lists, and before performing each step of the job.

List Data Structures

Most languages have support for “collections” (lists, etc) which make it easy to write and maintain code. An example of this is “template < class T, class Alloc = allocator > class list;” in C++. These are almost never acceptable and shouldn’t be used. To understand why, you need to understand how CPUs and caches work, and how “collections” are typically implemented.

As far as the CPU is concerned, all memory is split into “cache lines” (note: for 80x86 a cache line is a contiguous 64 byte block of memory). The CPU only really works on whole cache lines - if you write one byte, the CPU fetches the whole (64 byte) cache line, modifies one byte, then (later) writes that whole cache line back to RAM. How long it takes to fetch a cache line depends on where it is. If it’s in the CPU’s “L1 data cache” it might take 4 cycles to access it, if it’s in the CPU’s “L2 data cache” it might take 40 cycles to acccess it, and if it’s not in any cache it might take 400 cycles to access it. Cache line fetches lead to pipeline stalls (the CPU being unable to do anything useful because it’s waiting for data to arrive).

To get maximum performance you need to minimise the chance of cache misses, because fetching data from RAM is several orders of magnitude slower. There are 2 ways to minimise cache misses: prefetching (get the data into cache before it’s actually needed, so its in cache when it is needed) and “locality” (put all data that will be needed at the same/similar time into the least cache lines). Note: “locality” is more than just caches, and effects paging (and TLB misses) too, but the idea remains the same - keep pieces of data that will be needed at the same/similar time close to each other.

Now consider a typical linked list. Each element in the list is allocated without regard to the location of other items in the list, so you can kiss any hope of “locatility” goodbye before you even start - the list is going to be scattered all over the place. Worse, when iterating the list the CPU can’t prefetch the next entry because the address of the next entry is stored in the previous entry (in the previous entry’s “next” field). If that’s not bad enough, C++ adds an extra level of indirection - rather than having a linked list of data structures, you end up with a linked list of “list blocks that point to data structures” which adds a whole extra layer of locality and prefetching problems. The end result is a massive performance disaster. By doing it right you can easily beat the “list” class provided by C++ by an order of magnitude (and beat whatever LUA provides by many orders of magnitude).

So, how do you do it right?

The CPU is extremely good at nice sequential accesses - it even has (multiple) “hardware prefetchers” built in for this access pattern. If you use “array of structures” to store your list, you get the best possible locality and the best possible prefetching (and the best performance). What you want is to pack the most important information into the structure but no more, so you end up with the highest number of structures packed into each cache line (e.g. try to get 32-byte structures with 2 structures/jobs per cache line). Any “less important” excess information (information that isn’t needed for some jobs) can be shoved off into a secondary structure if necessary.

The way to do it is to determine the maximum amount of entries you’d want store, then (statically if you can, e.g. “myStruct myJobPool[MAX_ENTRIES];”) allocate the array, and implement a ring-buffer by having “first used entry” and “last used entry” variables. To iterate you do something like a “entry = first; while(entry != last) { …; entry++; if(entry > MAX_ENTRIES) entry = 0; }” loop.

When entries are removed you clear an “in use” flag in the entry (note: not the same as the “cancelled” flag mentioned about - a job can be in progress, cancelled but still in use because the hearthling doing the job hasn’t realised it’s been cancelled yet), and if it was the first entry you search for the next highest “not deleted yet” entry (until “first == last”, indicating the list is now empty). When entries are added you store them at “last” if you can; but if you can’t (there’s no room) you search starting from the entry after “first” for a “not in use” entry. If it’s possible that all entries are in use (you’re trying to store more than “MAX_ENTRIES” entries, likely because there’s no way to guess how many jobs the player might have at one time) then you need a secondary “overflow list” (which can be something like a linked list because it’s only going to be used in extreme conditions). Note that these searches and skipping “not in use” entries sound a little expensive at first glance, but you can check many entries in less time than a single cache miss would cost.

The other thing this does is (almost) completely remove the cost of memory management (new, delete) when creating or removing jobs, because the memory is preallocated (as part of the array itself) and recycled (and never actually freed).

Note that (for C++) the “vector” class solves some of the problems that the “list” class (and “forward_lists” and “deques”) have (vectors use sequential accesses for iterating over “pointers to what you want”) but creates other problems (resizing, removing items in the middle, etc) and doesn’t fix the “extra level of indirection” problem or the “locality” problem or the memory management problem.

4 Likes

Although we use a different representation for the hearthling ai than you discuss here, most of the principles you mention are present in one form or another when evaluating what a hearthling should do. (Internally we use a tree with priorities at the nodes and prune subtree’s whose priorities cannot pre-empt higher valued nodes. Each hearthling action is composed of sub-actions that are only evaluated if the prerequisites of the prior nodes have been met.)

One exception is that we have not scaled out threading (which could help pathfinding quite a bit), and that we use LUA coroutines for cooperative multitasking which only runs on a single core.

There is still a lot of optimization work to be done, and as we add more features, we need to take a performance pass every few alphas to measure and optimize the behavior. First we optimize the algorithms, then the lower-level implementation, and then finally the memory allocation and cache-line behavior. Doing this takes time and we try to strike a balance between features, content, stability, and performance.

6 Likes

Oh wait, so is the game not making use of all the computer’s cores, then? So we should expect a massive performance boost once that’s worked out?

Stonehearth currently can make use of about 3-4 cores. The major threads are server, client, and UI. Lua itself won’t get true multithreading, but the primitives it calls (pathfinding etc) could use more threads to increase computational throughput.

1 Like

Oh, okay. It sounds like there isn’t any significant risk of slowdown from the one Lua thread falling behind the multiple C++ threads it’s calling, since the C++ code has so much more to do?

If a lua method is slow, we can usually move the computationally intensive part into C++.

1 Like

Ah - Thanks :slight_smile:

One exception is that we have not scaled out threading (which could help pathfinding quite a bit), and that we use LUA coroutines for cooperative multitasking which only runs on a single core.

I assumed the game (server) was properly multi-threaded (e.g. with a “one worker thread per logical CPU that pulls work from a global queue” approach), so my previous estimates (200+ hearthlings on a modern “quad-core with hyper-threading (8 logical CPUs)” system) were wrong.by a factor of about 5.

Taking that into account; the game is actually performing better than I’d expect. :relaxed:

1 Like

Oh wait, so is the game not making use of all the computer’s cores, then? So we should expect a massive performance boost once that’s worked out?

It depends on many things; like what your hardware is.

For an old dual-core laptop I wouldn’t expect multi-threading to make much/any difference.

What I like to do if I can (e.g. with games like Minecraft, StarMade, etc) is play single-player with the game’s server running on one computer and game’s client running on a different computer (and gigabit ethernet between); and in that case I’d expect significant improvements.