Artificial General Intelligence — The path to Superintelligence | by Brent Oster

32 min readJan 24, 2021

Artificial General Intelligence — The path to Superintelligence

This article contains a complete description for building an AGI — covered under Provisional Patent US #63/138,058, filed Jan 15, 2021. And it is rock-solid, by a pro.

brent.oster@orbai.com

I worked for a decade at NVIDIA, as a CUDA Devtech Engineer and Solution Architect to research deep learning techniques, and present solutions to customers to solve their problems and to help implement those solutions. Now, for the past 3 years I have been working on what comes next after DNNs and Deep Learning. I will cover both, showing how it is very difficult to scale DNNs to AGI, and what a better approach would be.

Check it out the effort at http://www.startengine.com/orbai

What we usually think of as Artificial Intelligence (AI) today, when we see human-like robots and holograms in our fiction, talking and acting like real people and having human-level or even superhuman intelligence and capabilities, is actually called Artificial General Intelligence (AGI), and it does NOT exist anywhere on earth yet. What we actually have for AI today is much simpler and much more narrow Deep Learning (DL) that can only do some very specific tasks better than people. It has fundamental limitations that will not allow it to become AGI, so if that is our goal, we need to innovate and come up with better networks and better methods for shaping them into an artificial intelligence.

Lets do a deep dive into the science and tech of human intelligence and AI:

Where Deep Learning and Reinforcement Learning are today
What their limitations are — what can they do and not do?
The neuroscience of human intelligence
A possible architecture to achieve artificial general intelligence

1) Today’s Deep Learning and Reinforcement Learning

Let me write down some extremely simplistic definitions of what we do have today, and then go on to explain what they are in more detail, where they fall short, and some steps towards creating more fully capable ‘AI’ with new architectures.

Machine Learning — Fitting functions to data, and using the functions to group it or predict things about future data. (Sorry, greatly oversimplified)

Deep Learning — Fitting functions to data as above, where those functions are layers of nodes that are connected (densely or otherwise) to the nodes before and after them, and the parameters being fitted are the weights of those connections.

Deep Learning is what what usually gets called AI today, but is really just very elaborate pattern recognition and statistical fitting to data. The most common techniques / algorithms are Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Reinforcement Learning (RL).

Convolutional Neural Networks (CNNs) have a hierarchical structure (which is usually 2D for images), where an image is sampled by (trained) convolution filters into a lower resolution map that represents the value of the convolution operation at each point. In images it goes from high-res pixels, to fine features (edges, circles,….) to coarse features (noses, eyes, lips, … on faces), then to the fully connected layers that can identify what is in the image. The cool part of CNNs is that the convolutional filters are randomly initialized, then when you train the network, you are actually training the convolution filters. For decades, computer vision researchers had hand-crafted filters like this, but could never get them as effective as CNNs can get. Additionally, the output of a CNN can be an 2D map instead of a single value, giving us image segmentation. CNNs can also be used on many other types of 1D, 2D and even 3D data.

Recurrent Neural Networks (RNNs) work well for sequential or time series data. Basically each ‘neural’ node in an RNN is kind of a memory gate, often an LSTM or Long Short Term Memory cell. When these are linked up in layers of a neural net, these cells/nodes also have recurrent connections looping back into themselves and so tend to hold onto information that passes through them, retaining a ‘memory’ and allowing processing not only of current information, but past information in the network as well. As such, RNNs are good for time sequential operations like language processing or translation, as well as signal processing, Text To Speech, Speech To Text,…and so on.

Reinforcement Learning is a third main Machine Learning method, where you train a learning agent to solve a complex problem by simply taking the best actions given a state, with the probability of taking each action at each state defined by a policy. An example is running a maze, where the position of each cell is the ‘state’, the 4 possible directions to move are the actions, and the probability of moving each direction, at each cell (state) forms the policy.

By repeatedly running through the states and possible actions and rewarding the sequence of actions that gave a good result (by increasing the probabilities of those actions in the policy), and penalizing the actions that gave a negative result (by decreasing the probabilities of those actions in the policy). In time you arrive at an optimal policy, which has the highest probability of a successful outcome. Usually while training, you discount the penalties/rewards for actions further back in time.

In our maze example, this means allowing an agent to go through the maze, choosing a direction to move from each cell by using the probabilities in the policy, and when it reaches a dead-end, penalizing the series of choices that got it there by reducing the probability of moving that direction from each cell again. If the agent finds the exit, we go back and reward the choices that got it there by increasing probabilities of moving that direction from each cell. In time the agent learns the fastest way through the maze to the exit, or the optimal policy. Variations of Reinforcement learning are at the core of the AlphaGo AI and the Atari Video Game playing AI.

2) Limitations of Deep Learning

But all these methods just find a statistical fit to large amounts of data using simple models — DNNs find a narrow fit of outputs to inputs that does not usually extrapolate outside the training data set, and may learn incorrect features to identify objects with that do not work for novel inputs. Reinforcement learning finds a pattern that works for the specific problem (as we all did vs 1980s Atari games), but not beyond it. With today’s ML and deep learning, the problem is there is no true perception, memory, prediction, cognition, or complex planning involved. There is no actual intelligence in today’s AI.

The reason that speech interfaces in devices are so limited and awkward to talk to is that existing DL is very narrow for speech-to-text and natural language processing and can only train to learn specific phrases and map them to specific intents, actions, or answers, giving only a skeleton of language comprehension with some clever scripting but not conversational speech capability.

As another use case, we do not have useful home robots today, ones that can freely navigate our homes, avoid obstacles, pets, and kids, and do useful things like cleaning, doing laundry, even cooking. This is because the narrow slices of deep learning available for vision, planning, and control of motors, arms, and manipulators cannot take in all this varied input, be able to train on all the possible combinations of states, let alone plan, and do useful tasks from them.

3) The Neuroscience of Human Intelligence

To go beyond where we are today with AI, to pass the threshold of human intelligence, and create an artificial general intelligence requires an AI to have the ability to see, hear, and experience its environment or input data streams. It needs to be able to learn that environment, to organize its memory and store abstracted concepts with distributed features so it can model that environment, and the objects, people, and events in it.

It probably needs to be able speak conversationally and interact verbally like a human, and be able to understand the experiences, events, and concepts behind the words and sentences of language and how they connect, so it can compose language at a human level. It needs to be able solve all the problems that a human can, using flexible memory recall, analogy, metaphor, imagination, intuition, logic and deduction from sparse information. It needs to be capable at the tasks and jobs humans are and to express the results in human language in order to be able to do those tasks and professions as well as or better than a human.

The human brain underwent a very complicated evolution starting sometime since 1 billion years ago from the first multi-cellular animals with a couple neurons, through the Cambrian explosion where eyes, ears and other sensory systems, motor systems, and with them more neurons and intelligence exploded in an arms race (along with armor, teeth, and claws). Evolution of brains then followed the needs of fish, reptiles, dinosaurs, mammals, and finally up the hominids lineage about 5–10 million years ago.

Much of the older parts of the human brain were evolved for the previous 500 million years of violence and competition, not the last thousands of years of human civilization, so in many ways out brain is maladapted for our modern life in the information age, and not very efficient at many of the tasks we use it for in advanced professions like law, medicine, finance, and administration. A synthetic brain, focused on doing these tasks optimally can probably end up doing them much better, so we do not seek to re-create the biological human brain, but to imbue ours with the core functionality that makes the human brain so flexible, adaptable and powerful, then augment that with CS database and computing capabilities to take it far beyond human.

Because deep learning DNNs are so limited in function and can only train to do narrow tasks with pre-formatted and labelled data, we need better neurons and neural networks with temporal spatial processing and dynamic learning. The human brain is a very sophisticated bio-chemical-electrical computer with around 100 billion neurons and 100 trillion connections (and synapses) between them. I will describe two decades of neuroscience in the next two paragraphs, but here are two good videos about the biological Neuron and Synapse from ‘2-Minute Neuroscience’ on YouTube that will also help.

Each neuron takes in spikes of electrical charge from its dendrites and performs a very complicated integration in time and space, resulting in the charge accumulating in the neuron and (once exceeding an action potential) causing the neuron to fire spikes of electricity out along its axon, moving in time and space as that axon branches and re-amplifies the signal, carrying it to thousands of synapses, where it is absorbed by each synapse. This process causes neurotransmitters to be emitted into the synaptic cleft, where they are chemically integrated (with ambient neurochemistry contributing). These neurotransmitters migrate across the cleft to the post-synaptic side, where their accumulation in various receptors eventually cause the post-synaptic side to fire a spike down along the dendrite to the next neuron. When two connected neurons fire sequentially within a certain time, the synapse between them becomes more sensitive or potentiated, and then fires more easily. We call this Hebbian learning, which is constantly occurring as we move around and interact with our environment.

Neurons are organized into micro-structures in the brain. The cortex of the brain consists of a ‘sheet’ of neurons about 2–3mm thick, about 7–9 layers of neurons deep. This sheet is wrapped and folded around the brain to form the cerebral cortex, the main structure for processing information, including sensory inputs, motor control, language understanding, speaking, cognition, planning, and logic.

This sheet is divided laterally into about 1 million cortical micro-columns of 100,000 neurons each. These micro-columns run from the bottom to the top layer of the cortex, and have more connections up and down than laterally, and most of their lateral connections do not extend beyond more than a few columns. So these little columns form discrete ‘computational’ units that are essentially remote from their neighbors, but interconnected with the whole brain through complex internal structures and connections.

How they function exactly is the subject of great debate in neuroscience, but their existence and what we know about them shows that these units are similar, replicated across the cortex, and specialized to certain functions in certain areas. This is useful info in making an AGI, cause if we can build and train an artificial cortical column and connect thousand of them into a sheet in the right way — we have a start at an artificial brain — maybe.

The brain’s cortices consisting of these specialized cortical columns evolved to have networks with very sophisticated space and time signal processing, including feedback loops and bidirectional networks, so visual input is processed into abstractions or ‘thoughts’ by one directional network, and then those thoughts are processed back out to a recreation of the expected visual representation by another, complementary network in the opposite direction, and they are fed back into each other throughout. Miguel Nicolelis is one of the top neuroscientists to measure and study this bidirectionality of the sensory cortices.

For an example, picture a ‘fire truck’ with your eyes closed and you will see the feedback network of your visual cortex at work, allowing you to visualize the ‘thought’ of a fire truck into an image of one. You could probably even draw it if you wanted. Try looking at clouds, and you will see shapes that your brain is feeding back to your vision as thoughts of what to look for and to see. Visualize shapes and objects in a dark room when you are sleepy, and you will be able to make them take form, with your eyes open

These feedback loops not only allow us to selectively focus our senses, but also train our sensory cortices to encode the information from our senses into compact ‘thoughts’ or Engrams that are stored in the hippocampus short term memory. Each sensory cortex has the ability to decode them again and to provide a perceptual filter by comparing what we are seeing to what we expect to see, so our visual cortex can focus on what we are looking for and screen the rest out as we stated in the previous paragraph.

The frontal and pre-frontal cortex are thought to have tighter, more specialized feedback loops that can store state (short-term memory), operate on it, and perform logic and planning at the macroscale. All our cortices (and brain) work together and can learn associatively and store long-term memories by Hebbian learning, with the hippocampus being a central controller for memory, planning, and prediction.

Human long-term memory is less well known. We do know that it is non-local, as injuries to specific areas of the brain don’t remove specific memories, even a hemispherectomy which removes half the brain. Rather, any given memory appears to be distributed through the brain, stored like a hologram or fractal, spread out over a wide area with thin slices everywhere. We know that global injury to the brain, like Alzheimer’s — causes a progressive global loss of all memories, which all degrade together, but no structure in the brain seems to contribute more to this long-term memory loss than another.

However, specific injury to the hippocampus causes the inability to transfer memory between short term and long-term memory. Coincidentally, it also causes the inability to predict and plan and other cognitive deficits, showing that all these processes are similar. This area is the specialty of prominent memory neuroscientist, Eleanor Maguire, who states that the reason for memory in the brain is not to recall an accurate record of the past, but to predict the future and reconstruct the past from the scenes and events we experienced, using the same stored information and process in the brain that we use to look into the future to predict what will happen, or to plan what to do. Therefore the underlying storage of human memories must be structured in an abstracted representation in such a way that memories can be reconstructed from some for the purpose at hand, be it reconstructing the past, predicting the future, planning, or imagining stories and narratives — all hallmarks of human intelligence.

But how does our brain do more than just record all our experiences? How do we predict things we have never seen, or plan for events we’ve never experienced, or say things we’ve never heard? If all our experiences are train tracks, how do we figure out what is in between? This is why humans dream, it fills in the spaces between our experiences and builds models of our world from them.

Besides moving memories from short-term episodic memory (hippocampus in humans), in dreaming — the brain forms connections between potentially related memories, which we experience as REM dreams — somewhat fantastic, sometimes nonsensical sequences of events, but self-consistent and ordered. For every sleep study done, researchers could identify with 90% accuracy which dream reports had been randomized by the researchers after recording them. There is an order and meaning to our REM dreams. Once the dreaming process has formed these connections, we now have fictional ‘memories’ that help us model our world and we can use to predict future events or plan contingencies. Take a look at When Brains Dream for amazing neuroscience research in this area by Antonio Zadra and Robert Stickgold.

Replicating all of the brain’s capabilities seems daunting when seen through the tools of deep learning — image recognition, vision, speech, natural language understanding, written composition, solving mazes, playing games, planning, problem solving, creativity, imagination, because deep learning is using single-purpose components that cannot generalize. Each of the DNN/RNN tools is a one-of, a specialization for a specific task, that cannot generalize, and there is no way we can specialize and combine them all to accomplish all these tasks.

But, the human brain is simpler, more elegant, using fewer, more powerful, general-purpose building blocks — the biological neuron, and connecting them by using the instructions of a mere 8000 genes, so nature has, through a billion years of evolution, come up with an elegant and easy to specify an architecture for the brain and its neural network structures that is able to solve all the problems we met with during this evolution. We are going to start by just copying as much about the human brain’s formation process and functionality as we can, then using evolution to solve the harder design problems.

So now we know more about the human brain, and how the neurons and neural networks in it are completely different from the DNNs that deep learning is using, and how much more sophisticated our simulated neurons, neural networks, cortices and neural networks would have to be to even begin attempting to build something on par with, or superior to the human brain.

At the end of this article is the video talk about neuroscience and AGI that I submitted to NVIDIA GTC 2021 Conference (40 min) that builds on this article. Here is a 2 min teaser:

https://youtu.be/YmqLYaiSOn0

4) How can we build an Artificial General Intelligence?

If you ask any person that works in AGI how it can be done, you will probably get a different answer from each of them, because nobody has done it successfully and there are only theories. I’ve put 4 years into this design, and a couple decades of research before that, and I’m a practical engineer as well as a scientist, and it has been pretty well scrutinized on Quora for 3 years since I started writing it. I will give you the quick version of the design before we dive into the details:

We will use Spiking Neural Network Autoencoders for sensory encoding and output decoding (they are bidirectional), and develop our AGI core that is capable of taking in this encoded data, sifting it into a basis set pf features, training a model on how those features change in time during an experienced narrative of events, how they relate to each other via the features, and then use that model to simulate or dream fictional narratives to fill in the blanks in the model. This helps take our AGI core well beyond deep learning, by allowing it to build a full, feature-rich and accurate model of its world, and augment that model by dreaming, like humans do. This is the only way to have something like a speech AI learn sentences it has never heard, or for a robot to do tasks it has never experienced — structured dreaming to fill in the blanks, reinforced (or degraded) by later experiences.

Now the details of the nuts and bolts:

1) AGI Methods. We propose methods and processes for running computer simulations of Artificial General Intelligence (AGI) that is able to operate on general inputs and outputs, that do not have to be specifically formatted, nor labelled by humans and can consist of any alpha-numerical data stream, 1D, 2D, and 3D temporal-spatial inputs, and others. The AGI is capable of doing general operations on them that emulate human intelligence, such as interpolation, extrapolation, prediction, planning, estimation, and using guessing, and intuition to solve problems with sparse data. These methods will not require specific coding, but that can rather be learned unsupervised from the data by the AGI and its internal components using spiking neural networks. Using these methods, the AGI would reduce the external data to an internal format that computers can understand, be able to do math, linear algebra, supercomputing, and use databases, yet still plan, predict, estimate, and dream like a human, then be able to convert the results back to human understandable form.

a) The input system would learn to autoencode any time domain input, including alphanumeric streams, 1D, 2D, and 3D inputs using the SNN autoencoders in (6), to encode them into compact engram streams, and write these engram streams to short-term memory.

b) After a predetermined duration of short-term memory has been recorded, it is batch processed by cutting it into segments with by convolving it with time-domain functions like Gaussian or unit step function centered at time t and advancing t by dt each time such that the segments have a predetermined overlap.

c) Processing the engram segments using a hierarchical sorting architecture of autoencoders and PCA operations, then convolving engram segments with the vector basis sets at the leaf nodes to transform them to a set of basis coefficients

d) Storing basis coefficients encoded from inputs (or those computed internally) — into temporal narratives in memory

e) Doing mathematical, neural net, and other computations between sets of basis coefficients

f) Using neural net constructs such as predictors, solvers, and dreamers to do operations on narratives of basis coordinates.

g) Transforming the internal basis coefficient narrative representations back to engrams using the autoencoder hierarchy, then back to real-world outputs using the autoencoders.

h) Creating outputs to drive time domain outputs for actuators, language, and other outputs using a ROS / inhibitor network scheme.

With these developments, we will take the first steps towards AGI that can perceive the real world, reduce those perceptions to an internal format that computers can understand, yet still plan, think and dream like a human, then convert the results back to human understandable form, and even converse fluently using human language, enabling online interfaces and services that can interact much more like a person.

See NeuroCAD Utility Patent US # 16/437,838 + PTC, filed June 11, 2019 for details of 2–8. Summarized here for reference:

2) Methods for designing and evolving spiking neural networks (SNN), which consist of computer simulated (assumed from here forward) artificial neurons connected by axons and dendrites with a synapse between each axon and dendrite. Spikes of current are transmitted from the neuron, out along the axon, then are absorbed at the synapse and processed. The synapse may then, depending on the computation, transmit a spike out along the dendrite and to the neuron. Each time the neurons on either side of a synapse fire in sequence, that synapse is ‘strengthened’ and the likelihood of transmitting a spike increases. The spikes move in time and space and this temporal circuitry is key to the spiking neural network’s functionality and utility. Axons and dendrites can branch, with the outgoing spikes splitting and being amplified as they go out along the branches of the axon. Likewise, signals can combine as dendrites merge before entering the neuron. The incoming signals to a neuron can be excitatory or inhibitory, adding or subtracting charge from the neurons. Neurons can then integrate the incoming signals or differentiate them, or other operations, then emit spikes based on their internal computation.

3) To model these in a computer, we use mathematical models for the neurons and synapses that integrate, differentiate, or otherwise compute the contribution of the incoming charges and compute an output based on the mathematical model over time. We move the discrete spikes of current along the axons and dendrites at a constant speed, checking for when they enter a synapse or neuron. Sensory inputs are translated to spikes of current into sensory neurons.

4) Each neural network connectome, consisting of neurons, connected by axons, synapses, and dendrites, all connected in a neural network is represented by a compact genome, which is a small chunk of data consisting of numbers and alphanumeric data, which compactly represents the topology of the network, the number and size of layers, the types of neurons in them, and the statistical distribution of the connections from neurons in one layer or topological region to another. These genomes (G) always expand deterministically to the same connectome ©, and they interpolate smoothly, such that a genome G, that is interpolated to be between G0 (expands to C0) and G1 (expands to C1), when expanded, will result in a neural net connectome, C that is between C0 and C1 in properties. This ‘smoothness’ property is necessary for the genetic algorithms to converge.

5) We use evolutionary methods for designing and evolving spiking neural networks (SNN), using genetic algorithms operating on the compact genomes — that are expanded to the full neural networks to be trained, then evaluated according to specified criteria, to see if they will be crossbred for the next generation, repeating till a SNN that meets the specified criteria is evolved.

6) We use these methods to evolve SNNs specialized for different functions, including bidirectional interleaved autoencoders that consist of layers of neurons that alternate between ‘even’ layers containing mostly forward connections, skipping one layer to the next even layer and ‘odd layers containing mostly reverse connections, with those connections skipping a layer to the next odd layer, with some crossover in the connectivity, with final connectivity determined by genetic algorithms. Input comes in at layer 0, is encoded through the encoder into the bottom layer(s), where it is forced into a constrained bottleneck, and then decoded back through the autoencoder to layer 1, which is fed back into the even layers to generate a training feedback loop. The exact connections between layers, and feedback is determined by evolution via the genetic algorithms to find the configuration with optimal performance, with the selection criteria including encode/decode quality, latency, and encoded size) with the encoding and decoding method and encoded format learned at runtime in training and evolution

8) We put the data through our SNN autoencoder and in the mid-section, through a constriction or narrowing during the autoencoding process and enlarge it again and train the autoencoder to reproduce the original data, by doing so, the data is compressed at the constriction, but in a way that the entire autoencoder circuitry stores all the common features of the entire data set it has encoded to date (a set of basis vectors), and the output at the constriction is the set of basis coordinates referencing the basis vectors internal to the autoencoder. We take the output from the area or volume of constriction for each input and we record this into memory in time as an ‘engram stream’ or encoded memory stream that is the analogous to human short-term memory in the hippocampus.

9) We autoencode these input streams into engram streams in a process where there can be one or more such inputs generating multiple engram streams. Inputs may be encoded into one engram stream each, both into the same engram stream with interleaving, convolution, addition or other operations, or a hybrid where they are encoded into their own engram streams and both encoded into a hybrid engram stream with interleaving, convolution, addition or other operations.

10) Our encoded engrams will be in the form of these low-dimensional constricted volumes with the time dimension they were recorded in — termed an engram stream, representing a compressed record of the inputs in time, that can be reconstituted or decoded back into the original input. We have a starting point with being able to reduce the sensory inputs for our AGI to this compressed format, but they are still unwieldly, and we cannot do useful operations on them, except for volumetric convolutions to test them against other engrams. We need a better basis set.

11) By convolving the engram stream with a function (such as a step function or Gaussian) at multiple points in the time domain, we can cut it into intervals or segments, at different time intervals (t0 + dt * j) with j spanning the engram stream to be segmented. By choosing the convolution function parameters, and the value of dt, this will determine the overlap in time of the volumes. By doing this, we reduce the engram stream to a set of unit volumes in four dimensions, a little ‘swirl’ of reality in each volume.

12) These volumes are a more compact and useful format that either the raw inputs or the longer engram streams. We can create an orthogonal basis set of such volumes — essentially a set of orthogonal basis vectors that spans the space of previously experienced engram segments, then any engram segment can be decomposed into a linear combination of the vectors of this basis set by convolution with each vector of the basis set in a reversible process. Then when the basis vectors are each multiplied by the corresponding basis coordinates and they are linearly combined, the original engram segment is reconstituted.

We will term this set of basis coordinates in time a ‘narrative’ in ‘long-term’ memory, essentially a stream of numbers that are much more useful for doing calculations on.

13) We still have not determined how we will compute our orthogonal basis set. Usual methods like Gram-Schmidt would be too costly, because for our system, we need a basis set than can potentially span all of visible or audio reality. The size of the basis set, and the mechanism for computing it would be immense and computationally prohibitive with these methods. We need to be able to work in parallel.

14) Fortunately, we already have an analogy of such a system in the human brain. The human brain has analogous structures. The cerebral cortex is a sheet about 4mm thick wrapped and folded around the outside of the brain, that consists of cortical micro columns, each containing about 100,000 neurons, 7 neuron layers deep. The cortical columns are depicted like this:

15) These cortical columns look a lot like our autoencoders, which take in inputs like vision, audio,… and encode them to a compact engram, storing the common information about all the inputs it has seen in the autoencoder circuitry and the unique information about each input in the engram. Done in a hierarchy of autoencoders, in a cascade, this would gradually separate the similarity out of the engrams as they pass down the hierarchy, and leave us with orthogonal basis engrams at the leaf nodes.

16) More formally: a method for dynamically creating an orthogonal basis set of engram vectors (12), by submitting a batch of engrams and spreading them along an axis, sorted by a specific feature, forming clusters of engrams along the axis. These clusters are then each encoded by a specific autoencoder, removing their common feature, and then the resulting engrams are spread out on new axes sorted by new features. This is done recursively till one engram remains in each cluster, giving us a set of leaf nodes that constitute an orthogonal basis set of vectors. New engram batches can be added later to create new clusters, autoencoders, axes, and basis vectors, making it dynamic and able to learn.

17) There is a biological analogy for this hierarchy, the thalamocortical radiations, a neural structure that branches out like a bush from the thalamus (the main input / output hub of the brain for the senses, vision, audio and motor outputs) with the finest branches terminating at the cerebral cortex, feeding input from the senses to the cortical columns. The cortical columns of the cerebral cortex are analogous to our terminal layer of autoencoders, whose purpose is to store the orthogonal basis vectors for reality and do computations against them, including computing basis coordinates from input engrams. Each section of the cortex is specialized for a specific type of input (visual, auditory, olfactory,… ) or output (motor, speech), and our design will have a separate hierarchy and autoencoder basis set for each mode of input, to generate basis coordinates for that input / output mode.

18) To determine the basis coefficients for an input engram segment, the engram segment is passed through the process specified in (16), but singularly, splitting it off into engram segments that traverse the correct portion of the hierarchy, until convolution with the basis vectors at the leaf nodes determines the basis coefficients for that engram. This process can be used in reverse, multiplying the basis coefficients with the basis vectors at the leaves and passing them back up through the hierarchy of autoencoders to reconstruct the original engram. Basically we have a system that can deconstruct reality to numerical basis coordinate vectors and back. That will come in handy later for computations.

19) A method for organizing the narratives of basis coefficients (12) into segments, which can then be arranged hierarchically and/or connected to other segments and/or hierarchies to form composite structures in memory. Narrative segments and hierarchy structures can be collapsed and instanced, with multiple points in multiple hierarchies referring to the same instance of the segment (or hierarchy structure). These repeated child segments or hierarchies need only be stored once in memory, relationships with them and other data need only be computed once, reducing space and computational requirements.

20) A method for sampling basis coordinates from a plurality of encoded narratives (N), from a plurality of intervals in time (t-j,… t-2, t-1, t0, t1, t2,..tk)N, with these coordinates as input to a spiking neural network (SNN), as defined in (5) which computes the values of one or more sets of basis coordinates for one or more points in time (t-j,… t-2, t-1, t0, t1, t2,..tk) in a computed narrative as output. Before use, this SNN is trained on known inputs and outputs, and also evolved to perform its operations optimally. This allows us to do arbitrary computing using temporal narratives of basis sets (and the reality they represent) as our inputs and outputs.

21) A method for sampling basis coordinates from a plurality of intervals in time (t-j,… t-2, t-1, t0, t1, t2,..tk)N from a plurality of encoded narratives (N), with these coordinates as input to a spiking neural network, as defined in (5) which computes the values of one or more sets of basis coordinates for one or more points in time (t0, t1, t2,..tk) in a future narrative as output. We will term this method a “predictor model”. Before use, this SNN predictor is trained and evolved as per (5) on known inputs and outputs, and also evolved optimally to predict.

Concept diagram of predictor model, sampling past narratives, and SNN brain computing the future narrative.

22) A method for simulating memory narratives or ‘dreaming’ by allowing the predictor (21) to start its prediction inputs on existing memories, then move forward in time, detaching from the narrative to compute its future predictions based on using input from its just-generated predicted memories, creating a fictional or dream narrative (shaped by its model of reality) in the memory narrative behind it in time. Optionally, it can ‘attach’ to an existing narrative and detach to dream, multiple times. This is repeated to create dream narratives that form a web connecting experienced narratives, to augment them.

23) A method of continuously evaluating the dreamed memories as they are laid down and as they are traversed later, to decide if they should be attenuated or amplified, depending on their conformance to real memories and to the predictor model, or by reconstructing them into their corresponding engrams, and/or output data format to be evaluated.

24) A method of organizing language narratives structurally as in (19) for human language, such that each character for written language forms a basis vector, instanced in the language narrative by a basis coordinate. Spaces delineate segments consisting of words, punctuation delineates a segment (a hierarchy spanning words) of sentences, and CR characters delineate a segment (hierarchy of sentences) defining paragraphs. Similar organization is done for spoken language by having phonemes as the basis vectors, referred to by a basis coordinate, pauses delineating words, and longer pauses delineating sentences and paragraphs. Symbolic languages will be similarly structured in narratives and organized into hierarchy according to their written and spoken structure.

25) A method of connecting narratives (24) derived from different input types, such as visual and language, at coordinates where they are temporally, spatially or conceptually related, such that processing of one type of narrative can reference the related information in the other type of narrative as input or output in the processing. The method would also be able to connect between the higher levels of the hierarchies (9) to allow more abstract operations between the different levels. Making language the backbone for our AGI’s memory and cognition by connecting words to form references to visual objects and sounds, sentences to form abstractions for sequences of visual and audio events, and paragraphs to form abstractions scenarios and stories in memory, with each word, sentence and paragraph connected to one or more memory.

26) A method for the AI to converse naturally with a human, by training a set of dreaming predictors (22) evolved (5) to learn human language by training and evolution (5) on a plurality of human conversations, where they both learn proper responses, grammar, and composition from each training on one person’s side of the conversation, on the hierarchy of words and sentences (24) stated by each person in an alternating conversation. Then when actually conversing, the AI uses each predictors to predict what the other person will say next, and what it should say now, pulling words and phrases from previous segments of the conversation to incorporate them where appropriate, and dreaming where they need to ad-lib the conversation. Each predictor would also have connections to the information about other modalities (25) and their hierarchies, including visuals, audio, date, time, location to give the words context, and to interface with peripherals.

27) A method to compute a path through memory narratives (consisting of basis coordinates) and segments that can best solve a problem, by having predictors begin at a start point (consisting of starting basis coordinates), and/or at the goal (consisting of the desired ending basis coordinates), and then follow memory narratives away from their origin, branching at junctions one or more times until a path from the start connects with one from their goal, or they connect with their respective goal directly. The predictors can optionally detach from the narratives and dream as in (12) and reattach to another narrative. This process (27) can be iterated multiple times using the prior solution narratives and computing deviations to find better solutions.

28) A method for selectively screening input by performing operations on it with computed output data, such as a convolution, subtraction, addition, or other operation. The data to be used as a screen can be generated from a memory, computed by a predictor, allowing the AI to look or listen for what it anticipates, or as a spatial filter to specify what area of a 2D input to look at, such as scanning with a convolution kernel smaller than the input, or samples at grid points in a 2D input with different convolution extents to generate a multi-sample, multi-resolution representation of the 2D input.

29) A method for training the AI to produce desired outputs for external systems (actuators for drones, robots, or peripherals, displays, and other devices), and control them. Based on input basis coordinates, and operations between them and narratives in memory, and other means — output basis coordinates and narratives are computed. Then these synthesized basis coordinates are decoded to a synthesized engram and then to synthesized outputs (at the 2nd layer of the autoencoders) which are fed to the external systems. The autoencoders are bidirectional, so for training, inputs are fed in the top layer of the output autoencoders and are encoded to engrams and basis coordinates (representing output narratives for actuators, peripherals, and displays), and these are used to train the internal AI model and components on what outputs they should be computing.

30) A method for computing temporal outputs for motor controls, language, and other outputs based on ROS neurons. It originates with a signal that sets a tempo or pattern with time (t), that is the same regardless of the output to generate. This signal is then input at the root of each of a plurality of hierarchies (19, 24) of branching structures (with the outputs at the leaf nodes) that can be selectively inhibited at each branch and level (by modulating the inhibitory signal at each branch with the signal that was passed down into the hierarchy at the branch) to select which terminal node(s) of the hierarchy are activated when the tempo signal is >0. Each branching hierarchy forms a spatial-temporal basis set that can be controlled by the inhibitory signals, and the outputs from each blended via these inhibitory signals to form novel output units that are sequenced temporally. This is trained by back-driving the desired outputs to train the inhibitory signals to generate that output.

31) Creating a branching pyramidal neural structure branching from the ROS temporal input origin up to the ‘cortical column’ autoencoders (15–18) such that each outermost branch terminates at one autoencoder, with the signal strength being the basis coefficient fed into that autoencoder. Now, when the ROS temporal input fires, the signal travels through the branches (modulated by the inhibitory signals to each branch), delivering basis coefficients that modulate with the basis vectors in the cortical column / autoencoder layer. which are then propagated up through the HFM hierarchy, and decoded to a series of engram segments corresponding to the output of the ROS / inhibitor network. This engram stream is decoded to the correct output by the autoencoder, be it audio, speech, visuals, actuator controls.

32) A method of training a human ‘mimic’ AI using the methods of (29, 30) by using data from a performance capture of a real person acting out a plurality of scenarios to supply the inputs and outputs to the AI for the language, speech, vision, body movement, and facial movement of an artificial AI person, where that AI person can be without physical form, instantiated with 3D computer graphics as a character, or instantiated as a realistic humanoid robot, with the motion and facial expressions mapped to the actuators for the latter, with the goal being to provide an AI with realistic, human-like dialog, lip-sync, facial expressions and movement. The detailed performance capture data can be augmented with simpler text conversations that may be general or specific purpose to a vocation, and other general data to fill in the blanks in training, and also augmented with dreaming (22) between training sessions.

32) A method for defining the entire AGI system architecture described in these claims by a master genome, consisting of all parameters determining the functionality of the system as a whole and all the properties, and number of instances of all the components using methods defined in 1–20, their configurations, and all parameters determining their individual functionality. This genome is compact and can be expanded to the full AGI system deterministically, in a manner that is smooth, such that a genome (G) interpolated between two genomes (G1, G2) expands to an AGI (A) that is also between the two AGIs represented by (A1, A2) in form and function.

33) A method of evolving the AGI as in (5), with a plurality of AGI genomes expanded to a plurality of AGI architectures that are then trained and evaluated in parallel, with the genomes of the most successful AGIs cross-bred, mutated and used to define the next set of AGI architectures to test. This could include an iterative process to evolve the genomes of the internal components one or more times in a sub-process between iterations of evolving the whole AGI system. This allows the AGI to be evolved, refined, and expanded exponentially, limited only by computing power and memory available and the amount and quality of input data to train and test it on.

Resulting in: Artificial General Intelligence Eta in 2030

https://youtu.be/qU44BUagddY https://www.orbai.ai/artificial-general-intelligence.htm

Brent Oster

brent.oster@orbai.com

Written by Brent Oster