Instantiation
- the heart of consciousness
Possibly
the greatest software challenge for AI will be the instantiation
engine. It must reverse a 2D bitmap render of vision (or indeed
from any modality input) to recognize the environment and
objects from internal memory correlates (concepts) to recreate
the virtual 3D scene. There are really only a few common classes
of environment sets - countryside, office, kitchen, work bench,
shop, theatre, plane etc. If any environment match can be
found, a fully instantiated scene framework will be ready
to go, leaving only image scale, detail and perspective to
be resolved.
A
few pound lump of clay can instantiate a greater variety of
forms than the entire number of atoms in the universe. But
only a tiny subset of those forms will have any meaning attached
and be associated with any behaviors - cat, fridge, airplane
etc. The human mind is able to, with only a few pounds of
meat, instantiate form and behavior from novel 2D vision scenes
at the rate of about one object per second. Considering how
many 3D pattern matches that must be made against our library
of known objects, this is quite an achievement. In most circumstances,
significant mystery can remain within a scene (bitmap areas
without instantiation), so long as the major items are decoded
out; such as environments, significant life forms or emotionally
charged objects.
Possibly,
with unlimited time and processing power, artificial instantiation
could be achieved through 3D scene estimates, rendered down
to 2D and then compared with the bitmap input. Corrective
feedback cycles could iteratively discover the light sources
(from radiosity and shadow effects) and camera perspective
(from room edge key points or with lock-in provided from a
single object discovery). But it should be possible to design
faster search algorithms than such brute force trials. Perhaps
by comparing pre-rendered trial object 'icons' to the 2D scene.
Or in reverse, by extracting edge patterns from the 2D image,
normalizing scale and tossing those into a search path through
memory to catch shape and/or surface pattern matches.
The
challenge is to design a 3D object description language that
can be interrogated rapidly and one based on fuzzy search
criteria. You cannot use a polling search metaphor against
a million images, each of a thousand orientations; you have
to use an 'interrupt' or 'vector' search metaphor. Human vision
is based on the identification of features rather than exact
form, thus a violin twisted around a pole can still be recognized;
or a clock printed on a crumpled table cloth. The challenges
of high speed instantiation make the decisions where to focus
attention; on the motion of a cat or to follow the eyes of
a human, seem almost trivial by comparison.
Just
as a human is built upon autonomous biological layers, cognition
has its own autonomous layers. For instance instantiation,
morphing
11-
and tweening
15-
(the construction of in-between time frames during simulation).
When we script a human actor entering a room, the motion tweens
do not need to be consciously re-calculated; their construction
is either automatically generated or already stored in memory
as an animated motion tween. Only the environment, context
and emotional attitude need to be scripted in order to direct
simulations.

Rendering
is the translation of 3D scenes to 2D bitmaps. Instantiation
is the reverse, the creation of 3D scenes from 2D bitmaps.
Using a neuron array metaphor, where a projected image triggers
firing along an axon. If those neurons each have say 255 axons
(connections) propagating out, within that tangle there is
spatially encoded all possible orientations and translations
of any 3D object. The decoding out of that data could be achieved
from the propagating input wave function through time. For
example, if each of the elements on two opposing faces are
connected to every element on the opposing face. I.e. each
input pixel has 255 vectors spreading out. If it took one
hour for the signal of a firing neuron to travel along the
axons between the surfaces, and you divided that time period
up into small enough units, at any instant in time, a set
of those vectors from the expanding pixel wave fronts will
be optimally aligned to a specific translation of the projected
object. Were those vectors known (trained), and linked together,
the full 3D translation could in theory be described by those
lateral connection sets. If those connection channels were
two way, the objects could either be instantiated (identified)
from input modality patterns, or in reverse, be used to trigger
the same visual imagery (memorized experience) but directly
from the linked network patterns, themselves connected to
similar and associated modality patterns of visual and oral
language tags, or even taste, smell and touch attributes.
Modality
flows, whether from sight, sound, touch, taste or smell manifest
in the brain as parallel analog data channels of specific
and appropriate frequency, phase and dynamic (amplitude) ranges.
The same principle of instantiation applies equally to all
these analog sensory data sets; with receiving neural arrays,
optimally tuned to the character of each input class. For
example the sound of a word or event, as with vision, will
enter the neural array as a parallel 2 dimensional analog
data wave-front of frequency and phase channels or 'aural
pixels', extending into the neural array as a third dimension
through time. Cross connections linking spatial patterns will
again identify those with the closest correlation to existing
memory traces. In this way, as for vision, if only part of
a word is heard, in any tone or accent, or even masked by
other sounds, there will be sufficient signature correlation
to make reasonable probabilistic guesses for subsequent wider
context simulation trials and grading. These data signatures,
being now instantiated, are thus linked to the universal environment
map of objects and environments. Otherwise, the inputs would
merely remain unidentified sounds bearing only fleeting similarities
to known aural traces.
Instantiation
processing from sensory modalities is automatic and unconscious;
there is little mental effort involved, and further, not only
are the 3D objects instantiated, but also are any associated
animation tweens (object behaviors). Just as bitmaps link
to 3D objects, so those 3D objects link to form animated behaviors,
either as internal memorized tweens or newly constructed object
motion or morph tweens.
Take
a mouse object at time t1 and a teaspoon at t2, place them
in the same spatial location and connect their surfaces together
with orthogonal vector lines. Divide those lines into equal
'time' segments and render a perspective to create frames
for the movie script. This process is known as 'morph tweening',
and will represent one of the core visual translation tools
necessary for AI to both interpret modality flow and to create
new and novel content. During any visual thought process,
creating smooth in-between renders between distant or disparate
objects in time and/or space will be crucial.
Even
apart from AI, the commercial spin-offs from an instantiation
engine will be enormous. To start with, consider the possible
re-animation of all historic language documents and visual
2D media, to create a cornucopia of rich, new, flexible animatable
content.
