I am cross-posting this to Mud-Dev... this is part of a discussion on
Devmud about how different parts of the mud should talk to each other.
Devmud's design is modular, and it would be great to have a core set of
modules that are generic and re-usable for multiple purposes.
I realize a lot of you aren't interested in Devmud (or else you would
probably have seen this a few minutes ago :) but a lot of the issues that
have come up are probably of "general" interest to developing/enhancing any
kind of mud server. Plus, chances are good that some of you have gone
through a similar design process and faced similar decisions... hopefully
you have some feedback or stories to share.
Anyway, the context is "trying to develop an API for a generic database
module". Earlier I had suggested working on a generic API rather than
specific code, so that there could later be multiple interchangeable
modules that all use the same interface... meaning more flexibility for the
end user. There may eventually come a time when a devmud-derived server
offers both a "free" database and a "high-end" plug in to a
high-performance third party DB.
In this note:
- discussion about how to pass values between client and db modules (long)
- discussion about indexing and sort/traverse/search, why you might need it
Chris Gray wrote:
>[Greg's suggested interface looks fine to me.]
Thanks. Though as I thought about it later, I started to toss around other
ideas... specifically the idea that data should be passed in as a large
block, like a struct. This is great for dbm type databases, but other
table/field-oriented db's will need to break thing up into individual
fields, especially if you want indexing (more on that below).
So, I'm now working on an alternate proposal, that would break records up
into fields that the caller defines ahead of time. You could use the
interface for storing a single binary, even the in-memory representation of
a struct, but in order to properly sort and traverse/search the data, the
database needs to either:
learn about the structure of the data to know which bytes to index
callback to a client function to deconstruct it
or, just have the client hand in pre-separated fields in the first place :)
The revised proposal would have groups of functions to add new fields to a
database, to get and set individual fields, or get/set multiple fields at a
time (call them "requests" maybe)... this implies that we have a way to
call a function and hand it multiple values, or get back multiple values;
meaning that a simple struct would not serve the purpose.
This leads into the discussion of how to pass values back and forth...
>[Greg Connor:]
>> CG's previously posted implementation loads
>> things to an in-memory cache, and gives you back a pointer, but this may
>> cause problems for threading/concurrency, if it is being changed by another
>> thread, or if it is flushed from the cache before you are done... I am open
>> to suggestions here.)
>
Chris Gray wrote:
>Agreed - my model is not suitable for a multi-threaded environment. Also,
>the pointer that the client gets is right into the DB cache. Any errors
>by that client code, in terms of writing beyond the length of the DB
>entry, can cause very hard to find problems. I'm not sure of the semantics
>of Greg's 'read_item' entry. Does it return a pointer to the data, or
>must the user pass a pointer to a buffer to fill in? If the former, then
>the memory-stomping sensitivity still exists. If the latter, then a
>mechanism is needed to signal an insufficient buffer size.
Well, I hadn't quite worked that part out yet :) There is a design
decision lurking here, which is basically a trade-off between efficiency
(speed or space optimizations) and making double-sure that everyone's code
"does the right thing" to avoid memory leaks, writing out-of-bounds memory
or dereferencing nil. Laying down some rules will make the API more
complicated, but I think it can be done with an eye toward minimizing
performance penalty.
Just in terms of how to pass a "block" of data around, some alternatives
come to mind.
- Client pre-allocates memory and hands a pointer to buffer to be filled
Client can use a fixed-size local/stack buffer, or malloc/free its own mem
Not great for unknown-size data, buffer always too big or too small
Buffer too small means that the db function must return an error or
some overflow flag
Better for concurrency/threading, same memory not being used by
multiple clients
- DB Alloc's the mem, client is expected to free it
No too-big/too-small problems, buffer is always the right size
Can't use stack/local mem
Trusting the client to free - encourages leaks
Clients must be robust, not "just bail out" in error conditions, must
free even if something goes wrong
Better for concurrency/threading, same memory not being used by
multiple clients (same)
- DB hands a pointer to "temporary space" that will be recycled later
No leaks - DB does the cleanup later
No way to know when to clean up - can be freed too soon, especially if
whole package is threaded/concurrent
If client keeps using the space, risk of clobbering someone else's
memory or reading wrong results
- DB uses open/close semantics
Structure is loaded up by DB, and not released until client signals it
is done
Can use the same copy from the cache for multiple clients
DB must keep reference count
Chance that client won't signal when done, items can get "lodged" in
the cache
- DB uses "request cookies" and read/write/copy semantics
Instead of just a pointer, the client gets a "cookie" - either a number
that the DB can track back to a specific request, or a pointer-plus-ID-number
If item is still loaded in cache, read/write/copy can just do memory
operations.
If item has been unloaded/flushed, and cookie doesn't "map to" what's
in memory, DB either returns an error result, or forces the item to reload
(but doesn't read or write a buffer that's already freed)
If client wants to pass stuff by reference, it will have to request it
be copied to local storage. That is, to enforce this kind of "electric
fence" you can't just give a pointer to a block; you have to make a copy
for each client... this is an efficiency penalty (usually an extra
in-memory copy).
"Cookies" could be similar to what someone (JC Lawrence?) proposed for
object-ID, like an a ID/creation time pair
I am starting to lean toward the last method... it seems like a natural fit
with the idea of having a cache and serving multiple clients (possibly
multiple threads) from the same pool of resources. I realize there is a
performance penalty, and that it also means the API will be more
complicated. Going in this direction means sacrificing some speed, but
reducing the chance of memory-clobber and possible crashing results. It
also means some extra work on the part of the client-module programmer
up-front, but has a good chance of reducing the work of debugging strange
bugs in the integrated system as a whole later in the development.
In order to make the db re-entrant and thread safe, the DB would have to
make sure that its routines use proper semantics, but if this is all
implemented within the DB module, at least there is a *chance* of making
things multi-thread aware :) Whereas if you hand a simple pointer to an
external routine, you may never be able to get complete control of thread
issues.
Chris Gray wrote:
>A general question on DB's for MUDs: are indexes used, and if so, what
>for? I can see them used to look up player-character names on login.
>What else? Looking up object names doesn't make much sense - you don't
>want to search the whole world for a referent - only the stuff in the
>same "room" or being held - that will be a few orders of magnitude
>smaller in a big MUD. Looking up verbs works if they are not tied to
>base objects and all synonyms/abbreviations are also entered.
Usually when I hear the term "index" (or possibly "key") the image that
comes to mind is "optimization" - usually for searching or sorting.
Searching means that you want to find a record that matches your criteria,
without having to examine *every* record in the table. What you said about
player names is an example... you want to be able to find the record with
that name (string) in it, or quickly determine whether or not a certain
name is already "taken". But there are other examples.
Let's take the other example you gave: you are in a room with five other
objects and you want to search them for verbs. In this case, it doesn't
make sense to have a global list of all verbs - that's kind of backwards,
as you pointed out. You don't want to look for instances of "push", then
parse the list of buttons to see if any are in the room with you. What you
really want is to look at a small number of objects to see if any of them
have a "push" verb.
But even in this case you probably want an "index" - this time, it's an
index by "location". Your mud code may or may not maintain a constant list
of "things that are nearby" - chances are it doesn't. Also, maybe it makes
sense to just look at the room you're in for its "contents" - but
ultimately it boils down to "which five records, out of thousands, have the
same 'location' value as my player?"
"Contents" and "location" are reciprocal, but are also many-to-one. You
usually don't store the ID's of all the "contents" in the database record
with the "room" or "container" - usually you will optimize this lookup by
stashing a First and Next (so you can "walk the list" of contents) or by
using an index on "location" as a key and searching (also so you can "walk
the list" of records meeting your criteria).
There are a fair amount of "many to one" relationships (parent/child,
container/location, owner/property). Some db's I have seen use a kind of
"linked list" for each relationship, where each item knows it's "parent",
but the parent only knows one "child" and each child has a pointer to the
next, and the next, etc. This is really optimized for "walking a list" but
these lists can become inconsistent due to bugs/crashes/etc... if the db is
inconsistent, you get items located in a room that aren't part of its
contents list, and other "impossible" conditions.
With an "index" type of ability, you can search for records with a given
criteria and "walk the list" in the same way, but you save a little
overhead in that each record doesn't need to know its first child/next
child (possibly empty)... you just draw queries from the master list. The
relationship data has to be stored somewhere, but it's a little nicer not
to have to store the "chains" as part of the record itself (and pass it
around in the struct, etc).
There is still a chance that the index will become inconsistent due to bugs
or crashes, but I would rather have the DB developer write and debug this
feature once, and not have to chase down inconsistency problems that are
specific to one specific field or relationship.
Many back-end databases (i.e. third-party software that is not
mud-specific) have "index" ability, so if the database module of the mud is
a pass-through to some back-end, you get this ability at little extra cost.
If some back-ends have it and some don't, and you have to make the
interface "generic" as we are trying to do, you have to either limit
functionality to the "lowest common demoninator" or add new code to make up
for needed features that the back-end doesn't provide. In the case of
indexes, I would guess that it would need to be invented in places where it
doesn't exist, more often that being there and not used.
There are probably other examples of things you want to
sort/traverse/search in the database that I am not thinking of here...
That's all for now. I will continue to work on the generic database API
and post a revised proposal soon. If anyone has additional suggestions
about features they want in a mud database, from the perspective of coding
other modules, please post! I want this database to be useful to the
maximum number of other developers.
gregc