Introduction to deep learning from logical calculus to artificial intelligence

The XOR Problem

In the 1950s, the Dartmouth conference marked a pivotal moment for artificial intelligence, highlighting the emerging interest in neural networks Marvin Minsky, a key figure in AI and a participant at the conference, was finalizing his dissertation at Princeton in 1954, titled "Neural Nets and the Brain Model Problem." This groundbreaking thesis not only tackled various technical challenges but also compiled the latest findings and theorems on neural networks Additionally, in 1951, Minsky constructed a machine, supported by funding, which further contributed to the field's development.

Russell and Carnap significantly influenced Pitts, yet many contemporary logicians remain unaware of his contributions This article aims to reintroduce Pitts to the academic community and ensure he receives the recognition he deserves for his remarkable work.

7 And any other scientific discipline which might be interested in studying or using deep neural networks.

The webpage on Walter Pitts at http://www.abstractnew.com/2015/01/walter-pitts-tribute-to-unknown-genius.html is a valuable resource worth exploring Pitts, alongside the Air Force Office of Scientific Research, developed SNARC (Stochastic Neural Analog Reinforcement Calculator), marking the first significant computer implementation of a neural network Notably, Marvin Minsky, who served as an advisor for the film 2001: A Space Odyssey, was recognized by Isaac Asimov as one of the only two individuals with intelligence that surpassed his own, the other being Carl Sagan Minsky's contributions to the field of deep learning are significant and will be further explored.

Frank Rosenblatt earned his PhD in Psychology from Cornell University in 1956 and significantly advanced the field of neural networks with the invention of the perceptron learning rule, which dictates how neural network weights are updated Initially, his perceptrons were programmed on an IBM 704 computer at the Cornell Aeronautical Laboratory in 1957, leading to further developments in the Mark series.

The Perceptron, a computer designed to implement neural networks using the perceptron rule, was developed by Frank Rosenblatt In his 1962 book, "Principles of Neurodynamics," he delved into various architectures and introduced the concept of multilayered networks, akin to today's convolutional networks, which he termed C-systems This work can be regarded as a theoretical foundation for deep learning Tragically, Rosenblatt passed away in a boating accident on his 43rd birthday in 1971.

In the 1960s, two significant trends emerged in research, with the first focusing on symbolic reasoning through deductive logical systems, exemplified by programs like Herbert Simon, Cliff Shaw, and Allen Newell's Logic Theorist and General Problem Solver These programs achieved noteworthy results, unlike neural networks, which struggled to demonstrate comparable intelligence despite their ability to perform tasks such as image classification Symbolic systems were favored for their control and extensibility, while neural networks were perceived as less intelligent because they could not replicate complex human-like reasoning, such as theorem proving and chess playing Hans Moravec's exploration in the 1980s highlighted that while symbolic thinking is rare and valued in humans, it comes naturally to computers, which face challenges in performing basic intelligent behaviors that humans easily accomplish, like recognizing animals in photos or manipulating objects.

The second trend was the Cold War Starting with 1954, the US military wanted to have a program to automatically translate Russian documents and academic papers.

Many people still view activities like playing chess or proving theorems as superior forms of intelligence compared to gossiping, highlighting their rarity However, the uniqueness of an intelligence type does not necessarily relate to its computational characteristics, as problems that are simple to describe are generally easier to solve, regardless of their cognitive rarity in humans or machines.

Despite the abundance of funding, many technically skilled researchers underestimated the linguistic challenges in extracting meaning from language A notable example is the mistranslation of the phrase "the spirit was willing but the flesh was weak," which transformed into "the vodka was good, but the meat was rotten" after being translated back and forth between English and Russian In 1964, concerns arose regarding the effective use of government funds, leading to the formation of the Automatic Language Processing Advisory Committee (ALPAC) by the National Research Council The 1966 ALPAC report ultimately resulted in a significant reduction of funding for machine translation projects, causing stagnation in the field and turmoil within the broader AI community.

But the final stroke which nearly killed off neural networks came in 1969, from Marvin Minsky and Seymour Papert [15], in their monumental bookPerceptrons:

Computational geometry is a field that explores the mathematical properties of geometric objects and their relationships McCulloch and Pitts demonstrated that neural networks can compute various logical functions, yet Minsky and Papert highlighted a notable oversight: the equivalence function While the computer science and AI community often approaches this issue through the lens of the XOR function, which negates equivalence, the distinction lies solely in the labeling of these functions.

Perceptrons are fundamentally linear classifiers, which limits their ability to capture nonlinear relationships, such as the XOR problem The perceptron learning procedure is notable for its guaranteed convergence, yet it fails to accommodate nonlinear complexities In a 2D coordinate system, the XOR outputs create a scenario where separating the values with a straight line becomes impossible, highlighting the limitations of perceptrons Although the concept of multilayered perceptrons emerged, the existing learning rules hindered their construction Consequently, neural networks struggled to perform basic logical operations, a task easily managed by symbolic systems This period marked a decline in neural network research, particularly in the USSR, where cybernetics was dismissed as a bourgeois pseudoscience.

10 The view is further dimmed by the fact that the perceptron could process an image (at least rudimentary), which intuitively seems to be quite harder than simple logical operations.

11 Pick up a pen and paper and draw along.

12 If you wish to try the equivalence instead of XOR, you should do the same but with EQUIV(0, 0) = 1, EQUIV(0, 1) = 0, EQUIV(1, 0) = 0, EQUIV(1, 1) = 1, keeping the Os for

0 and Xs for 1 You will see it is literally the same thing as XOR in the context of our problem.

From Cognitive Science to Deep Learning

In the 1970s, despite a lack of significant advancements in neural networks, two key trends set the stage for their revival in the 1980s: the rise of cognitivism in psychology and philosophy and the concept of a paradigm shift in science introduced by Thomas Kuhn Cognitivism emphasized the mind as a complex system that should be explored independently of the brain, advocating for the development of systems that mimic neurological processes and behaviors This approach countered the earlier behaviorism of the 1950s, which treated the mind as a black box, and challenged the dualism between mind and brain prevalent in philosophical studies The cognitive revolution, marked by Chomsky's universal grammar and critiques of behaviorism, spanned six foundational disciplines—anthropology, computer science, linguistics, neuroscience, philosophy, and psychology—ushering in a new era of cognitive science focused on mutable change rather than immutable structures.

The Lighthill report, officially titled "Artificial Intelligence: A General Survey" by James Lighthill, presented in 1973, significantly impacted AI funding in the UK, leading the British government to close all but three AI departments This decision forced many researchers to abandon their projects, with the University of Edinburgh being one of the few departments that survived Notably, the report prompted an Edinburgh professor to reference cognitive science for the first time, marking a pivotal moment in the field.

13 A great exposition of the cognitive revolution can be found in [17].

B.F Skinner introduced scientific rigor to the study of behavior by emphasizing the importance of objective and measurable aspects, transforming what was once a largely speculative field of research.

Christopher Longuet-Higgins, a Fellow of the Royal Society and formally trained chemist, played a pivotal role in bridging cognitive science and deep learning, marking a significant moment in the evolution of these fields.

In 1967, while at the University of Edinburgh's Theoretical Psychology Unit, Longuet-Higgins raised critical questions regarding the justification of AI research, emphasizing the need for AI to enhance our understanding of human cognition rather than merely to create machines Although Lighthill recognized the potential of neural networks, he insisted that they must align with contemporary neuroscience findings and accurately model biological neurons Longuet-Higgins countered this perspective by highlighting that, much like computer hardware is only part of a system, understanding human cognition requires examining mental processes and their interactions He argued that these interactions are foundational to cognition, asserting that AI's true value lies in its ability to model and formalize these cognitive processes, rather than merely delivering technological or economic benefits.

Before the turn of the decade, another thing happened, but it went unnoticed.

Until recently, the community understood how to train single-layer neural networks and recognized that adding a hidden layer significantly enhances their capabilities However, the challenge remained: no one knew how to effectively train neural networks with multiple layers.

In 1975, economist Paul Werbos discovered backpropagation, a method for propagating errors through hidden layers in neural networks Although his finding initially went unnoticed, it was later rediscovered by David Parker, who published his results in 1985 That same year, Yann LeCun also independently discovered backpropagation and contributed to its publication.

In the 1980s, the cognitive era of deep learning emerged in sunny San Diego, marked by the pivotal discovery of backpropagation by Rumelhart, Hinton, and Williams.

The San Diego circle included notable researchers, among them Geoffrey Hinton, a psychologist and PhD student of Christopher Longuet-Higgins at the Edinburgh AI department During his studies, Hinton faced criticism from faculty for his focus on neural networks, which he referred to as optimal networks.

After graduating in 1978, the author joined the Cognitive Science program at UCSD in San Diego, where the academic environment was more receptive to neural network research Prominent figures at UCSD included David Rumelhart, a foundational figure in cognitive science who popularized artificial neural networks through the concept of connectionism, which remains influential in the philosophy of mind Terry Sejnowski, a physicist and computational biology professor, collaborated with Rumelhart and Hinton on key research papers Additionally, John Hopfield, Rumelhart's doctoral advisor, advanced the field by enhancing the recurrent neural network model known as the Hopfield Network Linguist Jeffrey Elman, also at UCSD, would later introduce Elman networks, further contributing to the development of neural network theory.

Jordan, a prominent psychologist, mathematician, and cognitive scientist, is known for introducing Jordan networks, which are often referred to as simple recurrent networks in contemporary literature He was also a member of the San Diego circle.

In the early 1990s, the AI community shifted its focus toward support vector machines (SVM), which were favored for their mathematical rigor over neural networks, seen as less relevant by many However, significant advancements occurred in the late 1990s, including the invention of long short-term memory (LSTM) networks by Hochreiter and Schmidhuber in 1997, and the development of the first convolutional neural network, LeNet-5, by LeCun et al in 1998, both of which laid the groundwork for deep learning Despite initially being overlooked, these innovations were pivotal, culminating in the 2006 introduction of deep belief networks by Hinton and his colleagues, which marked the rebranding of deep neural networks to deep learning This ushered in a new era in AI, leading to the emergence of various new architectures that will be discussed in this book, while leaving some for the reader to explore independently.

16 The full story about Hinton and his struggles can be found at http://www.chronicle.com/article/The-Believers/190147.

1.3 From Cognitive Science to Deep Learning 11

For an exhaustive treatment of the history of neural networks, we point the reader to the paper by Jürgen Schmidhuber [34].

Neural Networks in the General AI Landscape

Neural networks have evolved from philosophical logic, influenced by psychology and cognitive science, leading to their resurgence in AI and computer science A key inquiry is their position within the broader AI landscape, classified by organizations such as the American Mathematical Society (AMS) and the Association for Computing Machinery (ACM) The AMS's Mathematics Subject Classification 2010 categorizes AI into various subfields, including general AI, learning systems, pattern recognition, theorem proving, and natural language processing Meanwhile, the ACM classification details subclasses such as knowledge representation, planning, search methodologies, and computer vision Notably, machine learning exists as a parallel category to AI, rather than a subordinate one.

What can be concluded from these two classifications is that there are a few broad fields of AI, inside which all other fields can be subsumed:

Deep learning refers to a specialized category of artificial neural networks, which are a subset of machine learning algorithms These networks are particularly effective in fields such as natural language processing, computer vision, and robotics While this definition captures the essence of deep learning, it oversimplifies the concept, as there are deeper complexities involved.

17 See http://www.ams.org/msc/.

18 See http://www.acm.org/about/class/class/2012.

The article discusses the relationship between different components of artificial intelligence (AI), highlighting the distinction between vertical and horizontal elements Good Old-Fashioned AI (GOFAI) is characterized as a horizontal component that encompasses a broad range of work in knowledge representation and reasoning, while deep learning represents another horizontal component aiming to unify various disciplines Both GOFAI and deep learning seek to address comprehensive AI questions through their unique methodologies, with each having its strengths The concept of deep learning as a distinct influence is further examined in the literature, referring to it as the ‘connectionist tribe’.

Philosophical and Cognitive Aspects

In our exploration of neural networks, we have yet to clarify two key concepts, starting with the term 'cognitive.' Originating from neuroscience, 'cognitive' describes the outward expressions of mental behavior rooted in the cortex The definition of these abilities is well-established, as it is based on neural activity Therefore, a cognitive process within the realm of AI is defined by these neural underpinnings.

19 Knowledge representation and reasoning for GOFAI, machine learning for deep learning.

The transition from cognitive science to deep learning involves mimicking the mental processes of the human cortex Additionally, philosophy seeks to move beyond the biological aspects of the brain, aiming to establish definitions in a broader, more generalized context.

A working definition of ‘cognitive process’ might be: any process taking place in a similar way in the brain and the machine This definition commits us to define

‘similar way’, and if we take artificial neural networks to be a simplified version of the real neuron, this might work for our needs here.

The challenge of modeling cognitive processes highlights a significant issue: while deep learning has made strides in simplifying various cognitive functions, reasoning remains elusive Reasoning, a central focus of philosophical and formal logic, has been foundational to traditional AI approaches This raises the question of whether deep learning can ever fully grasp reasoning, or if learning itself is inherently distinct from reasoning, suggesting that reasoning may not be learnable at all This debate echoes the historical conflict between rationalists, who believe in an innate logical framework, and empiricists, who emphasize knowledge gained through experience.

A formal proof that no machine learning system could learn reasoning which is considered a distinctly human cognitive process would have a profound technological, philosophical and even theological significance.

The belief that dogs cannot learn relations raises important questions about their cognitive abilities For instance, when teaching a dog the concept of 'smaller,' we can create a training scenario where the dog must choose the smaller of two objects upon hearing the command This task is complex, as the dog must understand that 'smaller' is not simply a label for a single object, but rather a relational concept that only becomes meaningful when comparing two items Thus, the challenges involved in teaching dogs to grasp relational terms highlight the intricacies of their learning processes.

Logic is fundamentally relational, with every aspect consisting of relationships While relational reasoning can be effectively executed through formal rules, the challenge lies in understanding how to learn the content associated with these relations Traditionally, this has involved manually defining entities and their relationships, sometimes incorporating dynamic factors to allow for changes over time However, a significant gap persists between patterns and relations on both sides of this process.

The availability of literature on animal cognitive abilities is limited, with few academic studies linking animal cognition to ethology Our research has identified a single paper addressing the constraints of dog learning, leading us to refrain from making definitive claims, opting instead for hypothetical considerations.

Fodor and Pylyshyn's seminal paper highlights a critical philosophical issue in artificial neural networks and connectionism, arguing that thinking and reasoning are fundamentally rule-based and symbolic rather than innate mental faculties They suggest that reasoning evolved as a complex tool for preserving truth and predicting future events, presenting a challenge to connectionism For connectionism to achieve reasoning, it must create an artificial neural network that generates a system of rules, leading to symbolic reasoning where neural networks provide content while the reasoning process remains symbolic.

The argument hinges on the assumption that thinking is fundamentally rule-based; challenging this notion suggests that reasoning may involve intuitive processes as well Connectionist approaches, particularly through models like word2vec, have made significant strides in this area by demonstrating that words can be clustered by semantic similarity based on their contextual usage For instance, in a practical reasoning scenario, one might think, "It's too long for a walk; I better take my van," only to realize, "I forgot my van is at the mechanic; I’ll take my wife’s car." This illustrates that valid reasoning can occur without strict syllogistic forms, as it relies on recognizing similarities between concepts, such as equating 'car' with 'van.' Word2vec enhances this by allowing for calculations with word vectors, enabling analogical reasoning, exemplified by the equation v(king) - v(man) + v(woman) ≈ v(queen), marking a pivotal advancement in connectionist reasoning.

In the final chapter of the book, we will delve into the concept of reasoning within the framework of question answering Additionally, we will examine energy-based models and memory models, highlighting that the most effective approach to reasoning currently involves memory-based techniques.

Plato characterized thinking as a dialogue of the soul with itself in his work, the Sophist, and our goal is to emulate this concept In contrast, Aristotle's rule-based reasoning, presented in his Organon, has been the prevailing approach Therefore, we aim to reframe reasoning within a Platonic framework rather than adhering to the traditional Aristotelian model.

22 At this point, we deliberately avoid talking of ‘valid inference’ and use the term ‘valid thinking’.

Interchangeability of vehicles depends on the context; for example, while a car cannot transport a piano, it is suitable for grocery shopping, just like a van.

The relationship between memory and reasoning is often viewed as distinct in traditional cognitive settings, particularly influenced by Good Old-Fashioned Artificial Intelligence (GOFAI) However, neural networks and connectionism challenge this separation, suggesting a more integrated approach to understanding these cognitive processes.

1 A.M Turing, On computable numbers, with an application to the entscheidungsproblem Proc. Lond Math Soc 42 (2), 230–265 (1936)

2 V Peckhaus, Leibniz’s influence on 19th century logic in The Stanford Encyclopedia of Phi- losophy, ed by E.N Zalta (2014)

3 J.S Mill, A System of Logic Ratiocinative and Inductive: Being a connected view of the Prin- ciples of Evidence and the Methods of Scientific Investigation (1843)

4 G Boole, An Investigation of the Laws of Thought (1854)

5 A.M Turing, Computing machinery and intelligence Mind 59 (236), 433–460 (1950)

6 R Carnap, Logical Syntax of Language (Open Court Publishing, 1937)

7 A.N Whitehead, B Russell, Principia Mathematica (Cambridge University Press, Cambridge, 1913)

8 J.Y Lettvin, H.R Maturana, W.S McCulloch, W.H Pitts, What the frog’s eye tells the frog’s brain Proc IRE 47 (11), 1940–1959 (1959)

9 N.R Smalheiser, Walter pitts Perspect Biol Med 43 (1), 217–226 (2000)

10 A Gefter, The man who tried to redeem the world with logic Nautilus 21 (2015)

11 F Rosenblatt, Principles of Neurodynamics: perceptrons and the theory of brain mechanisms (Spartan Books, Washington, 1962)

12 F Rosenblatt, Recent work on theoretical models of biological memory, in Computer and Information Sciences II, ed by J.T Tou (Academic Press, 1967)

13 S Russell, P Norvig, Artificial Intelligence: A Modern Approach, 3rd edn (Pearsons, London, 2010)

14 H Moravec, Mind Children: The Future of Robot and Human Intelligence (Harvard University Press, Cambridge, 1988)

15 M Minsky, S Papert, Perceptrons: An Introduction to Computational Geometry (MIT Press, Cambridge, 1969)

16 L.R Graham, Science in Russia and the Soviet Union A Short History (Cambridge University Press, Cambridge, 2004)

17 S Pinker, The Blank Slate (Penguin, London, 2003)

18 B.F Skinner, The Possibility of a Science of Human Behavior (The Free House, New York, 1953)

19 E.L Gettier, Is justified true belief knowledge? Analysis 23 , 121–123 (1963)

20 T.S Kuhn, The Structure of Scientific Revolutions (University of Chicago Press, Chicago, 1962)

21 N Chomsky, Aspects of the Theory of Syntax (MIT Press, Cambridge, 1965)

22 N Chomsky, A review of B F Skinner’s verbal behavior Language 35 (1), 26–58 (1959)

23 A Newell, J.C Shaw, H.A Simon, Elements of a theory of human problem solving Psychol. Rev 65 (3), 151–166 (1958)

24 J Lighthill, Artificial intelligence: a general survey, in Artificial Intelligence: A Paper Sympo- sium, Science Research Council (1973)

25 Paul J Werbos, Beyond Regression: New Tools for Prediction and Analysis in the BehavioralSciences (Harvard University, Cambridge, 1975)

26 D.B Parker, Learning-logic Technical Report No 47 (MIT Center for Computational Research in Economics and Management Science, Cambridge, 1985)

27 Y LeCun, Une procédure d’apprentissage pour réseau a seuil asymmetrique Proc Cogn 85 , 599–604 (1985)

28 D.E Rumelhart, G.E Hinton, R.J Williams, Learning internal representations by error propa- gation Parallel Distrib Process 1 , 318–362 (1986)

29 J.J Hopfield, Neural networks and physical systems with emergent collective computational abilities Proc Natl Acad Sci USA 79 (8), 2554–2558 (1982)

30 N Cristianini, J Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel- based Learning Methods (Cambridge University Press, Cambridge, 2000)

31 S Hochreiter, J Schmidhuber, Long short-term memory Neural Comput 9 (8), 1735–1780 (1997)

32 Y LeCun, L Bottou, Y Bengio, P Haffner, Gradient-based learning applied to document recognition Proc IEEE 86 (11), 2278–2324 (1998)

33 G.E Hinton, S Osindero, Y.-W Teh, A fast learning algorithm for deep belief nets Neural Comput 18 (7), 1527–1554 (2006)

34 J Schmidhuber, Deep learning in neural networks: an overview Neural Netw 61 , 85–117 (2015)

35 P Domingos, The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World (2015)

36 M.S Gazzanga, R.B Ivry, G.R Mangun, Cognitive Neuroscience: The Biology of Mind, 4th edn (W W Norton and Company, New York, 2013)

37 A Santos, Limitations of prompt-based training J Appl Companion Anim Behav 3 (1), 51–55 (2009)

38 J Fodor, Z Pylyshyn, Connectionism and cognitive architecture: a critical analysis Cognition

39 T Mikolov, T Chen, G Corrado, J, Dean, Efficient estimation of word representations in vector space, in ICLR Workshop (2013), arXiv:1301.3781

Derivations and Function Minimization

This chapter provides essential mathematical foundations for understanding subsequent sections, focusing primarily on backpropagation, the core mechanism of deep learning Central to this process is gradient descent, which involves moving along the gradient—a vector of derivatives The initial section will explain derivatives, ensuring that readers grasp the concepts of gradients and gradient descent This topic will not be revisited, but it will be extensively utilized throughout the rest of the book.

In this article, we introduce a notational convention where ‘:=’ signifies definition, meaning ‘A := xy’ indicates that we define A to be xy Sets are fundamental mathematical concepts, serving as the building blocks for various other concepts A set is defined as a collection of members, which can include both other sets and non-sets, the latter being basic elements known as urelements, such as numbers or variables Typically, sets are represented using curly braces; for instance, A := {0, 1, {2, 3, 4}} illustrates a set containing three members.

In set theory, the notation {2,3,4} represents an element of set A, rather than a subset, which would be illustrated as {0,{2,3,4}} Sets can be defined extensionally by listing their members, such as {-1,0,1}, or intensionally by specifying the properties that members must satisfy, like {x | x ∈ Z ∧ |x| < 2}, where Z denotes the set of integers and |x| indicates the absolute value of x Both definitions describe the same set, highlighting the axiom of extensionality, which asserts that two sets are considered equal if they contain the same elements Consequently, the sets {0,1} and {1,0} are equal, as are {1,1,1,1,0} and {0,0,1,0}, since they all consist of the same members, 0 and 1.

1 Notice that they also have the same number of members or cardinality, namely 2. © Springer International Publishing AG, part of Springer Nature 2018

S Skansi, Introduction to Deep Learning, Undergraduate Topics in Computer Science, https://doi.org/10.1007/978-3-319-73004-2_2

A set does not retain the order of its elements or allow for repetitions; however, multisets, or bags, can accommodate repetitions while ignoring order For example, the multiset {1,0,1} is equivalent to {1,1,0}, but neither is equal to {1,0} To differentiate bags from sets, elements are often numbered, so instead of writing {1,1,1,1,0,1,0,0}, we would represent it as {"1":5,"0":3} Multisets are particularly useful for modeling language through the bag of words model, as discussed in Chapter 3.

A vector is represented by a sequence of values indicating both position and repetitions, such as (1,0,0,1,1) When dealing with a vector of variables, denoted as (x₁, x₂, , xₙ), it is referred to as x or x vector Each individual value, xᵢ, where 1 ≤ i ≤ n, is known as a component, and the total count of these components defines the dimensionality of the vector x.

Tuples and lists are similar to vectors, primarily used in programming to represent them, with typical variable names like myList or vectorAsTuple For instance, a tuple or list can be defined as newThing:=(11,22,33) The key distinction is that lists are mutable, allowing changes to their elements, while tuples remain immutable This means that if we have newThing:=(11,22,33), we can modify a list element with newThing[1] ←99, but we cannot do the same with a tuple.

In programming, when we assign a value of 99 to the second item of a list, we create a new mutated list, resulting in newThing:=(11,99,33) However, if we prefer immutability, we can use a tuple, which does not allow modification of its elements To create a new tuple, we can define newerThing by copying elements from newThing, such that newerThing[0] ← newThing[0], newerThing[1] ← 99, and newerThing[2] ← newThing[2], thus forming a new tuple without altering the original values To determine the type of an unknown data structure, we can attempt to modify a component, revealing whether it is a list or a tuple While vectors can be modeled as tuples, they are typically represented as lists in programming.

Functions are essential computational tools that transform inputs into outputs They require a clear definition of how this transformation occurs For example, in the function y = 4x³ + 18, x represents the input, y is the output, and f denotes the function's name The output y is derived by applying the function f to the input x, expressed as y := f(x) While there are additional details that could be explored, they are beyond the scope of this discussion.

When we think of a function like this, we actually have an instruction (algorithm) of how to transform thexto get they, by using simpler functions such as addition,

2 The counting starts with 0, and we will use this convention in the whole book.

The traditional definition of functions relies on sets, tuples, and relations, which may be overly complex for current requirements This approach limits our understanding, whereas a broader definition allows for a wider variety of entities to be recognized as functions.

Derivations and function minimization involve operations like multiplication and exponentiation, which can be derived from simpler functions For a comprehensive understanding of these concepts, readers can refer to the detailed proofs available in [2].

Note that if we have a function with 2 arguments 4 f(x,y)=x y and pass in values

Functions are order sensitive, meaning they operate on vector inputs, as demonstrated by the differing outputs when passing (2,3) to get 8 and (3,2) to achieve 9 This leads to the generalization that a function always takes a vector as input, with an n-dimensional vector function referred to as an n-ary function Notably, the notation f(x) can be utilized for these functions Additionally, a 0-ary function, which produces an output without any input, is known as a constant; for example, p() = 3.14159, highlighting the use of open and closed parentheses in its notation.

A function's argument input vector can be represented as (x1, x2, , xn, y), forming a graph of the function f for the inputs x Functions can include parameters, such as in f(x) = ax + b, where a and b are considered fixed but can be adjusted to improve the function's performance It's crucial to note that a function consistently produces the same output for identical inputs when the parameters remain unchanged However, altering these parameters can significantly impact the output, which is particularly relevant in deep learning, a method focused on automatically tuning parameters to enhance results.

An indicator function, also known as a characteristic function, assigns a value of 1 to elements that belong to a specific set A and a value of 0 to all other elements This function, denoted as χA in literature, varies for different sets A but consistently performs the same operation We refer to this function as 1A, and it serves as a foundational concept for one-hot encoding, which will be explored in the next chapter.

In mathematics, a function $ f: A \to B $ consists of a domain $ A $ for inputs and a codomain $ B $ for outputs Functions can be classified as total, where they are defined for all elements in the domain, or partial, where they are not A function consistently assigns the same output to a given input vector, and if it maps every element of the codomain to at least one input, it is termed a surjection Conversely, if it assigns unique outputs to different input vectors, it is classified as an injection A function that is both injective and surjective is known as a bijection The set of outputs corresponding to a set of inputs is referred to as the image, denoted $ f[A] = B $, while the inverse image, where we seek inputs for a given set of outputs, is represented as $ f^{-1}[B] = A $.

4 A function with n-arguments is called an n-ary function.

A function is termed monotone if, for every pair of values x and y within its domain, the condition holds that if x < y, then f(x) ≤ f(y), or if x > y, then f(x) ≥ f(y) Functions can be classified as increasing or decreasing based on this property, and if the inequalities are strict (using < instead of ≤), they are referred to as strictly increasing or strictly decreasing A continuous function is one that is free of gaps For our purposes, this simplified definition will suffice, prioritizing clarity over precision, with further elaboration to follow.

Vectors, Matrices and Linear Programming

Before proceeding, it's essential to understand the concept of Euclidean distance In a 2D coordinate system, the distance between two points, p1:=(x1,y1) and p2:=(x2,y2), is defined as d(p1,p2) = √((x1−x2)² + (y1−y2)²) This measure is fundamental to the behavior of the entire space, as it shapes our understanding of spatial relationships By utilizing Euclidean distance, we create Euclidean spaces, which align with our natural spatial intuitions Throughout this book, we will exclusively focus on Euclidean spaces.

In this section, we focus on the development of tools for n-dimensional vectors, represented as $x = (x_1, x_2, \ldots, x_n)$, where each $x_i$ is referred to as a component It is common to visualize n-dimensional vectors as points within an n-dimensional space, which will be formally defined as a vector space later on For the moment, our discussion revolves around a collection of n-dimensional vectors from $R^n$.

A scalar is a single number that can be viewed as a vector in R1, while n-dimensional vectors are sequences of n scalars Multiplying a vector by a scalar is straightforward; for example, multiplying the vector (1, 4, 6) by 3 results in (3, 12, 18) Vector addition is also simple, requiring two vectors, a and b, to have the same number of components The addition of these vectors is performed component-wise, as illustrated by (1, 2, 3) + (4, 5, 6) = (1 + 4, 2 + 5, 3 + 6).

To form a vector space, it is essential to work with vectors of the same dimensionality, while also including scalars, which are considered one-dimensional vectors The operations of scalar multiplication and vector addition are fundamental in establishing this vector space.

In exploring the nature of vectors, we focus on the 3D space where these vectors exist, represented as points within this environment It's essential to consider whether a minimal set of vectors can define the entire universe of 3D vectors, a concept that can be easily extended to n-dimensional spaces.

In this article, we focus exclusively on the use of R for vector representation, avoiding unnecessary complexities To express any vector in three-dimensional space, we utilize the formula s₁e₁ + s₂e₂ + s₃e₃, where e₁ = (1,0,0), e₂ = (0,1,0), and e₃ = (0,0,1), with s₁, s₂, and s₃ being scalars that define the desired vector This illustrates the significant role of scalars in vector representation, likening them to an aristocracy within the vector realm For instance, to represent the vector (1, 34, -28), we set s₁ = 1, s₂ = 34, and s₃ = -28, applying them in the linear combination formula This process demonstrates that every vector in a vector space can be described as a linear combination of the standard basis vectors e₁, e₂, and e₃, which collectively define the standard basis of the three-dimensional vector space, denoted as R³.

In the context of vector spaces, a basis is defined as a subset B of a vector space V that is both linearly independent and a minimally generating set This means that the vectors in B cannot be expressed as linear combinations of one another, and they are sufficient to generate every vector in V when combined appropriately.

In this article, we focus on the fundamental operation with vectors known as the dot product The dot product, which applies to two vectors of the same dimensions, yields a scalar result It is mathematically defined as the sum of the products of corresponding components, expressed as a · b = a1b1 + a2b2 + + anbn, where a = (a1, , an) and b = (b1, , bn).

The dot product of two vectors, represented as (1,2,3)·(4,5,6), is calculated as 1·4 + 2·5 + 3·6 When the dot product of two vectors equals zero, they are referred to as orthogonal Additionally, vectors possess lengths, which can be measured using the L2 or Euclidean norm The L2 norm of a vector quantifies its length in a mathematical context.

It is important to distinguish between the notation for norms and absolute values In upcoming chapters, we will explore the L2 norm in greater detail To create a normalized vector from any vector $ a $, simply divide it by its L2 norm, resulting in $ \hat{a} = \frac{a}{\|a\|_2} $.

Two vectors are calledorthonormalif they are normalized and orthogonal We will be needing these concepts in Chaps.3and9 We not turn our attention to matrices

17 A minimal subset such that a property P holds is a subset (of some larger set) of which we can take no proper subset such that P would still hold.

Matrices are an extension of vectors, structured like tables consisting of rows and columns Understanding matrices can be enhanced by relating them to the concepts previously discussed regarding vectors.

A matrix is characterized by its entries denoted as $a_{jk}$, where $j$ represents the row and $k$ the column Unlike vectors, matrices have two dimensions; for instance, matrix A is a 4×3 matrix, distinct from a 3×4 matrix Conceptually, a matrix can be viewed as a collection of vectors It can be interpreted in two ways: as row vectors $a_1 = (a_{11}, a_{12}, a_{13}), a_2 = (a_{21}, a_{22}, a_{23}), a_3 = (a_{31}, a_{32}, a_{33}), a_4 = (a_{41}, a_{42}, a_{43})$ stacked into a new vector $A = (a_1, a_2, a_3, a_4)$, or as column vectors $a_{x1} = (a_{11}, a_{21}, a_{31}, a_{41}), a_{x2} = (a_{12}, a_{22}, a_{32}, a_{42}), a_{x3} = (a_{13}, a_{23}, a_{33}, a_{43})$ bundled together as $A = (a_{x1}, a_{x2}, a_{x3})$.

In analyzing vectors, it's essential to differentiate between horizontal and vertical formats A horizontal vector, known as a row vector, is represented as a 1×n dimensional matrix, denoted as h = (a₁, a₂, a₃, , aₙ) Conversely, a vertical vector, referred to as a column vector, is structured as an n×1 dimensional matrix Understanding these distinctions is crucial for clarity in mathematical contexts.

Matrix transposition is an essential operation that transforms row vectors into column vectors and converts an m×n matrix into an n×m matrix while maintaining the order of rows and columns This process can be visualized as flipping a matrix displayed on a transparent sheet from portrait to landscape orientation Formally, for an n×m matrix A, the transpose B is created by repositioning each element a_jk to b_kj, denoted as A ⊤ Transposing a column vector results in a row vector and vice versa In deep learning, transposition is crucial for ensuring efficient operations A square matrix A is termed symmetric if it satisfies the condition A = A ⊤ Additionally, scalar multiplication involves multiplying each entry of a matrix A by a scalar s, represented as sA.

⎣ sãa 11 sãa 12 sãa 13 sãa 21 sãa 22 sãa 23 sãa 31 sãa 32 sãa 33 sãa 41 sãa 42 sãa 43

Matrix multiplication with a scalar is commutative, unlike matrix multiplication with another matrix, which is not commutative To apply a function f(x) to a matrix A, the function must be applied to each element of the matrix individually, resulting in f(A).

Probability Distributions

This section delves into essential statistics and probability concepts crucial for deep learning We focus on the specific elements necessary for this field, while also recommending two excellent textbooks for those interested in a deeper understanding.

Statistics serves as the foundation of data analysis by examining a population characterized by specific attributes To illustrate this concept intuitively, envision the population as the residents of a city, with their attributes, such as height, representing the diverse characteristics being analyzed.

A function may possess multiple local minima, yet it has only one global minimum While gradient descent can become trapped in a local minimum, our example features a single local minimum that coincides with the global minimum.

22 We stop simply because we consider it to be ‘good enough’—there is no mathematical reason for stopping here.

23 This book is available online for free at https://www.probabilitycourse.com/

24 Properties are called features in machine learning, while in statistics they are called variables,which can be quite confusing, but it is standard terminology.

Probability distributions are essential in statistics as they analyze various population characteristics, including average height, common occupations, and other attributes like weight, education, and foot size While statistical analysis requires organized and interpretable data, deep learning methods can operate effectively without such constraints.

To find the average height of a population, we take the height of all inhabitants, add them up, and divide them by the number of inhabitants:

The average height, or mean height, is a numerical value that can be calculated for various features, including weight and body mass index Features with numerical values are referred to as numerical features While the mean represents a numerical middle value, determining the 'middle value' for categorical data, such as occupations within a population, requires using the mode The mode identifies the most frequently occurring value, such as 'analyst.'

Binning is a valuable data preprocessing technique that simplifies non-numerical problems by aggregating numerical features into meaningful categories For instance, when analyzing monthly salaries, rounding figures to the nearest thousand—such as converting 2345 to 2000 and 3987 to 4000—ensures that values like 19.01, 19.02, and 19000034 are treated distinctly This method not only reduces complexity but also enhances clarity in data analysis, providing a clearer understanding of underlying trends.

In addition to the mean and mode, there exists a third perspective on centrality Consider the sequence 1, 2, 5, 6, and 10,000; in this case, the mode is ineffective since there are no repeating values, and binning is not applicable While the mean can be calculated as 2002.8, it provides little useful information about the sequence due to the presence of the outlier, 10,000 Outliers, which can be defined more rigorously later, are atypical values that skew statistical measures It's important to note that outliers are not exclusively large; for example, a value like 0.0001 could also serve as an outlier Understanding outliers is crucial for effective machine learning applications.

To determine a central measure that is not affected by outliers in the sequence 1, 2, 5, 6, 10000, the median is the most effective method For sequences with an odd number of elements, the median is identified as the middle value of the sorted list.

In our case, the median is 5 If we have the sequence 2,1,6,3,7, the median would be the middle element of the sorted sequence 1,2,3,6,7 which is 3 We have noted

25 Note that the mean is equally useless for describing the first four and the last member taken in isolation.

The sequence can be sorted in either ascending or descending order, which is not significant While having an odd number of elements is typically preferred for determining the median, adjustments can be made for even-numbered sequences By sorting the sequence and identifying the two middle elements, the median is defined as the mean of these two values For example, in the sequence 4, 5, 6, 2, 1, 3, the middle elements are 3 and 4, resulting in a median of 3.5 It is important to note that, unlike odd-numbered sequences, the median may not be a member of the original sequence, but this is generally not an issue for most machine learning applications.

In this article, we shift our focus from measures of central tendency to key concepts such as expected value, bias, variance, and standard deviation, beginning with fundamental probability calculations and distributions Probability, illustrated through the simple experiment of a coin toss, involves understanding all possible outcomes—in this case, heads or tails The calculation of basic probabilities relies on determining the total number of outcomes and the frequency of the desired result For instance, in a coin toss, the probability of landing heads is P(heads) = 1/2 = 0.5 To further clarify, we consider a more complex scenario involving a pair of D6 dice, where we aim to calculate the probability of rolling a five This requires identifying the total outcomes (B) and the successful outcomes (A) for achieving the desired result.

When rolling two six-sided dice, the total number of possible outcomes is 36 Each outcome is determined by the combination of the first die and the second die, with each die having six possible results For example, if the first die shows a 1, there are six possible outcomes for the second die, and this pattern continues for each number up to 6 on the first die Consequently, the probability of rolling a total of 5 is calculated as P(5) = 4/36, which simplifies to approximately 0.11 This method of calculating probabilities involves counting the favorable outcomes against the total possible outcomes.

27 This is the ‘official’ name for the mean, median and mode.

28 Not 5 on one die or the other, but 5 as in when you need to roll a 5 in Monopoly to buy that last street you need to start building houses.

29 In 6 2 , the 6 denotes the number of values on each die, and the 2 denotes the number of dice used.

Probability distributions are calculated by determining the likelihood of a desired outcome occurring, divided by the total number of possible outcomes It's important to note that rolling a 6 on the first die and a 1 on the second die counts as one distinct outcome, while rolling a 1 on the first die and a 6 on the second die counts as another Additionally, there is only one combination that results in a total of 2, which occurs when both dice show a 1.

Probability distributions are essential functions that indicate the frequency of occurrences for different outcomes To understand these distributions, we first need to define a random variable, which is a mapping from a probability space to real numbers, essentially representing a variable that can assume random values, typically denoted as X The values taken by this variable are represented as x1, x2, and so on Importantly, the term 'random' can be replaced by a specific probability distribution that favors certain outcomes over others For example, in a uniform distribution, where there are 10 elements in the probability space, each element has an equal probability of 0.1 This is the simplest form of a probability distribution Another example is the Bernoulli distribution, which models a random variable that can take the value 1 with probability p and the value 0 with probability 1−p; in the case of a fair coin toss, p would be 0.5, though other values of p are also possible.

To continue, we must define theexpected value To build up intuition, we use the two D6 dice example If we have a single D6 die, we have

The expected value of a random variable $ X $ is calculated using the formula $ EP[X] = x_1p_1 + x_2p_2 + + x_6p_6 $, where each outcome $ x $ corresponds to a probability $ p $ from the distribution $ P $ In this scenario, with six possible outcomes, each outcome has an equal probability of $ \frac{1}{6} $.

When rolling two D6 dice, the complexity of probability increases significantly, leading to a non-uniform distribution of outcomes For instance, the probability of rolling a total of 5 with two D6 dice is not simply 1 in 36, highlighting the intricate nature of dice probabilities.

30 What we called here ‘basic probabilities’ are actually called priors in the literature, and we will be referring to them as such in the later chapters.

Logic and Turing Machines

Logic plays a crucial role in the foundation of artificial neural networks, particularly in addressing challenges like the XOR problem While a comprehensive exploration of logic is outside the scope of this book, readers are encouraged to refer to [11] or [12] for excellent introductions This article will provide a brief overview, concentrating on the aspects of logic that are directly relevant to the theory and practice of deep learning.

Logic serves as the foundation of mathematics, relying on undefined concepts known as propositions, symbolized by letters such as A, B, C, P, and Q Atomic propositions are typically represented by the initial letters, while P and Q denote any type of proposition, whether atomic or compound Compound propositions are formed using logical connectives such as ∧ (and), ∨ (or), ơ (not), → (if then), and ≡ (if and only if) For example, if A and B are propositions, A → (A ∨ ơB) is also a valid proposition Most connectives are binary, with the exception of negation, which is unary Additionally, the concept of truth functions is crucial, where atomic propositions are assigned a value of 0 or 1, and compound propositions derive their truth value based on the truth values of their constituent propositions.

In propositional logic, the truth function t(X) evaluates various logical operations with specific conditions: t(A∧B) equals 1 if both t(A) and t(B) are 1; t(A∨B) is 1 if at least one of t(A) or t(B) is 1; t(A→B) is 0 only when t(A) is 1 and t(B) is 0; t(A≡B) equals 1 if both t(A) and t(B) are either 1 or 0; and t(¬A) is 1 if t(A) is 0 Additionally, the exclusive OR (XOR) operation is defined as XOR(A,B) := A≡B.

Propositional logic can be modified to include fuzzy logic, which allows truth values to range between 0 and 1, rather than being limited to just 0 or 1 This flexibility enables a more nuanced representation of propositions, accommodating degrees of truth.

Ameans, representing a steep decline, is not limited to a binary value of 1 ('true'), but can also take on a value of 0.85, indicating a state of being 'kinda true.' This concept highlights the nuanced nature of truth in fuzzy logic Additionally, the interplay between fuzzy logic and artificial neural networks is a significant and expanding field of research.

The primary extension of propositional logic involves breaking down propositions into properties, relations, and objects, transforming a simple proposition A into A(x) or A(x,y,z) Here, x, y, and z are referred to as variables, which require a specific set of valid objects known as the domain For instance, A(x,y) might express that 'x is above y,' with its truth value depending on the chosen variables By assigning constants like c and d to represent specific domain members, such as 'lamp' and 'table,' we can assert that A(c,d) is true Additionally, quantifiers such as ∃ (exists) and ∀ (for all) enable us to express the existence of objects with certain properties, like writing ∃xB(x) to indicate that there is at least one blue object in the domain, while ∀ would assert that all objects in the domain are blue This framework allows for the composition of complex sentences in logical expressions.

∃x(∀yA(x,y)∧ ∃zơC(x,z)), the principle is the same.

We can also a quick look at fuzzy first-order logic Here, we have a predicate

In the context of fuzzy sets, let P(x) represent the property of being fragile, and let c denote a flower pot The statement t(P(c)) = 0.85 indicates that the flower pot is somewhat fragile, belonging to the fuzzy set of fragile items with a membership degree of 0.85.

A Turing machine, introduced by Alan Turing, is a fundamental concept in logic and computer science, serving as the original model of a universal machine It consists of two main components: an infinitely long tape divided into cells and a head that can read, write, or erase symbols on the tape Each cell can contain a dot (•), a separator (#), or be blank (B) This simple device is capable of computing any function that can be expressed algorithmically, meaning that any computable function can be translated into a set of instructions for the Turing machine For example, to compute the addition of 5 and 2, one could formulate specific instructions for the machine to follow.

1 Start by writing the blank on the first cell Write five dots, the separator and three dots.

2 Return to the first blank.

To process the next symbol, check if it is a dot; if so, remember it and move right until you encounter a blank space, where you will write the dot If the next symbol is a separator, return to the beginning and cease further actions.

4 Return to step 2 of this instruction and start over from there.

Logic gates are fundamental components in digital circuits, representing logical connectives An AND gate requires two inputs and outputs a 1 only when both inputs are 1 In contrast, an XOR gate outputs a 1 when exactly one of its inputs is 1, highlighting the differences in their functionality.

A voting gate is a specialized logic gate that accepts multiple inputs and produces an output of 1 if more than half of the inputs are 1, while outputting 0 if no inputs are active The threshold gate generalizes this concept by introducing a threshold (T), outputting 1 when more than T inputs are 1, and 0 otherwise This model serves as the theoretical foundation for simple artificial neurons, which can be understood as threshold logic gates with equivalent computational capabilities in theoretical computer science.

Logic gates can be understood as electrical switches, where a binary 1 signifies the presence of current and a 0 indicates its absence While most gates function correctly, some are impractical but can be created by combining other gates For instance, when a negation gate receives a 0 input, it is expected to produce a specific output.

1, but this eludes our intuitions about currency (if you put two switches on the same

36 This is not exactly how it behaves, but it is a simplification which is more than enough for our needs.

2.4 Logic and Turing Machines 41 line and close one, closing the other will not produce a 1) This is a strong case for intuitionistic logic where the ruleơơP→Pdoes not hold.

Writing Python Code

Machine learning today is a process inseparable from computers This means that any algorithm is written in program code, and this means that we must choose a language.

We chose Python as our programming language because it is fundamentally a specification of code To create a program, you simply open a text file, write the correct code, and change the file extension to the appropriate format, such as py for Python While valid code is defined by its specific programming language, all program code is essentially just text that can be edited using any text editor.

Programming languages can be either compiled or interpreted, with compiled languages processed through code compilation and interpreted languages requiring an interpreter to execute programs Python is an interpreted language, necessitating an interpreter for execution, while ANSI C is a compiled language For Python, the recommended interpreter is Anaconda, available at www.continuum.io/downloads, with the latest version being Python 3.6 When installing Anaconda, it is advisable to accept all default options, particularly the option to prepend Anaconda to the path, to avoid potential issues like 'dependency hell.' For detailed installation instructions, refer to the Anaconda website.

After installing Anaconda, the next step is to create an Anaconda environment To do this, open your command prompt (Windows) or terminal (OSX, Linux) and enter the command `conda create -n dlBook01 python=3.5`, then press enter This command establishes an environment named `dlBook01` with Python 3.5, which is necessary for TensorFlow To activate the environment, type `activate dlBook01` in the command line and hit enter; your prompt will then update to reflect the active environment The environment will remain active until you deactivate it.

There are numerous text editors available, including Notepad, Vim, Emacs, Sublime, Notepad++, Atom, Nano, and cat, many of which are free to use It's advisable to experiment with different editors to find the one that suits you best Additionally, Integrated Development Environments (IDEs) like Visual Studio, Eclipse, and PyCharm offer enhanced functionalities beyond standard text editors, although most IDEs are not free, with some offering trial versions for testing While IDEs provide added conveniences, it's important to note that anything an IDE can do, a text editor can also accomplish Personally, I prefer using Vim for my editing needs Remember to activate your command prompt by typing "activate dlBook01" each time you open it or restart your computer.

To install TensorFlow, first activate your environment and run the command `pip install upgrade tensorflow` If that doesn't work, try `pip3 install upgrade tensorflow` If issues persist, refer to the official TensorFlow website for troubleshooting, check the FAQ section, or seek assistance on Stack Overflow, where you can typically get responses within a few hours After successfully installing TensorFlow, proceed to install Keras by checking the installation requirements at keras.io/#installation and running `pip install keras` If Keras installation fails, consult the Keras documentation or return to Stack Overflow for further help.

After installing everything, open the command line and type "python" to launch the Python interpreter, where you should see 'Python 3.5' and 'Anaconda' displayed If it doesn't work, restart your computer, reactivate the Anaconda environment, and try typing "python" again If the issue persists, seek help on StackOverflow.

To verify your Python installation, open the Python interpreter, which should display a prompt like >>> You can test it by entering simple expressions, such as 2 + 2, which will return 4, and '2' + '2', which will return '22' Next, type import tensorflow and ensure it returns to the prompt without errors; if you encounter an issue, consult StackOverflow for assistance Finally, repeat this process to confirm your Keras installation is successful Once these steps are completed, your installation will be finished.

Each section of this book includes a fragmented code that should be saved in individual files, except for the Neural Language Models chapter, where code from both sections is combined in one file After saving the code as myFile.py, open the command line, navigate to the directory containing the file, activate the thedlBook01 environment, and execute the command "python myFile.py." This will run the code, display output on the screen, and potentially generate additional files It's important to note that "python" opens the interpreter for interactive coding, while "python myFile.py" executes the specified file directly.

A Brief Overview of Python Programming

2.6 A Brief Overview of Python Programming

In this section, we will delve into the fundamental data structures and commands in Python, building upon the installation of Python, TensorFlow, and Keras discussed earlier You can consolidate everything we explore here into a single Python file named testing.py To execute the code, save the file, open a command line in its directory, and type python testing.py We will begin by writing the first line of the file: print("Hello, world!").

In Python, a line of code typically consists of a string, such as "Hello world!", and the built-in function print( ) Built-in functions are prepackaged features that simplify coding, allowing users to create more complex functions later on For a comprehensive list and explanations of all built-in functions, visit https://docs.python.org/3/library/functions.html If the link becomes outdated, you can easily find the information through a search engine.

In Python, understanding data types is fundamental, with the most essential types being strings (str), integers (int), and decimals (float) Strings represent sequences of characters, integers are whole numbers, and floats are numbers with decimal points To interact with Python, you can access the interpreter via the command line For example, evaluating the expression "1" == 1 will yield False, as it compares a string to an integer Conversely, using != instead of == will result in True, indicating that the two values are not equal.

Python cannot directly convert an integer to a string or vice versa; for example, comparing `int("1") == 1` or `"1" == str(1)` will yield different results However, Python can convert integers to floats and vice versa, as demonstrated by the expression `1.0 == 1`, which evaluates to True It's important to note that the `+` operator has different meanings depending on the data type: for integers and floats, it represents addition, while for strings, it signifies concatenation, as shown in the expression `"2" + "2" == "22"`, which also returns True.

In the file testing.py, you can create a more complex function by utilizing basic functions For example, the function `subtract_one(my_variable)` takes an input variable and returns its value decreased by one When you print the result of `subtract_one(53)`, it outputs 52.

The code introduces a fundamental concept in Python programming by defining a function named "subtract_one" that accepts a single parameter called "my_variable." The use of the "def" command signifies the beginning of the function, and the colon at the end indicates that additional instructions will follow.

In Python programming, comments are text segments within the code that the interpreter disregards They serve various purposes, including providing notes or alternative code snippets, enhancing code readability and maintainability.

In programming, four consecutive whitespaces indicate an indentation, which is essential for organizing code Alternatively, a single tab can replace these four spaces, but consistency is key throughout the file This article emphasizes the use of whitespaces for indentation Following the indentation, a return command concludes the function, specifying what value to return, in this case, my_variable - 1 The parentheses ensure Python correctly interprets the return value Additionally, comments can be included after the return statement, as they are ignored by the interpreter.

The third line, which is not indented, lies outside the function definition and directly calls the built-in print function to display the output of our defined function with the given value.

In Python, functions can execute without printing output, but adding a print statement is essential to display results on the screen To see this interaction between print and return, you should define the function before using it, as a simple copy/paste won't suffice In Python, a block of code includes the function definition and its indented lines, and while we've only explored function definition blocks so far, other blocks such as for-loops, while-loops, try-loops, and if-statements operate similarly Additionally, a fundamental operation in Python is variable assignment, which involves assigning a value to a variable using the syntax newVariable = "someString" You can assign various data types to a variable, including strings, floats, integers, lists, and dictionaries, but remember that a variable will only retain its most recent assigned value.

In Python, strings can be defined using either single quotes (' ') or double quotes (" "), but they must be closed with the same type of quote An empty string can be represented as either '' or "", and it is considered a substring of any string Experimenting with the Python interpreter, you can input various strings like "test" and "text" within 'testString', as well as empty strings, to observe their behavior Additionally, using the built-in function len() allows you to determine the length of an iterable, such as a string, list, or dictionary, while noting that data types like floats, integers, and characters are not considered iterables in Python.

You can extract substrings from a string in Python by assigning the string to a variable or by manipulating the string directly For example, if you assign the string "abcdef" to a variable called myVar, accessing myVar[0] will return the first character of the string, which is 'a' Remember, Python uses zero-based indexing, meaning the first element is accessed with the index 0.

38 Never call this an ‘if-loop’, since it is simply wrong.

Python programming utilizes zero-based indexing for iterables, meaning the first element is accessed with index 0 For a string of length N, valid indices range from 0 to N-1 To retrieve the last character of a string, you can use the index -1, which is equivalent to using the expression myVar[(len(myVar)-1)] This method not only allows for accessing elements but also enables saving specific characters to variables, such as assigning the third letter of a string to a variable with the code thirdLetter = myVar[2].

"c"in the variable You can also take out substrings like this Try to typesub_str

= myVar[2:4] or sub_str = myVar[2:-2] This simply means to take indices from 2 to 4 (or from 2 to -2) This works for any iterable in Python, including lists and dictionaries.

A list in Python is a versatile data structure that can hold various individual data types, enclosed in square brackets, such as [1, 2, 3, ["c", [1.123, "something"]], 1, 3, 4] Lists can contain duplicate values, and the order of elements is significant To add a value, use the append() method, like myList.append(1.234), and to create a blank list, initialize it with newList = [] You can also utilize len() and index notation for lists, similar to strings Experiment with creating blank lists and adding elements, and refer to the official Python documentation or StackOverflow for additional methods like append() Embrace the learning process, as programming may seem challenging at first but becomes enjoyable with practice Debugging is a crucial aspect of coding; don't be discouraged if your code doesn't work initially Instead, use print() to troubleshoot and remember that spending time correcting code is a normal part of the programming journey.

Lists consist of elements that can be accessed using their respective indices, which is the standard method for retrieval In contrast, a dictionary is a different data structure that utilizes user-defined keywords to access its elements For example, myDict={"key_1":"value_1", 1:[1,2,3,4,5], 1.11:3.456, 'c':{4:5}} is a dictionary containing four elements, as indicated by its length of 4 Each element in a dictionary comprises a key, serving a similar function to an index in a list, and a corresponding value.

A Simple Neural Network: Logistic Regression

Supervised learning is primarily categorized into two types: classification and regression Classification involves predicting a class, a concept we've encountered with naive Bayes and will revisit throughout this book In contrast, regression focuses on predicting a value, though we will not delve into regression in this text Instead, we will examine logistic regression, which, despite its name, is a classification algorithm rather than a regression one This distinction arises because it is viewed as a regression model in statistics, which the machine learning community has adopted as a classifier.

13 That is, the assumption that features are conditionally independent given the target.

Regression problems can often be simulated using classification techniques For instance, when determining a value between 0 and 1 rounded to two decimal places, this can be approached as a 100-class classification problem Conversely, in the context of naive Bayes, we can establish a threshold to classify values as either 1 or 0, demonstrating the interchangeability of regression and classification methods in certain scenarios.

Fig 3.5 Schematic view of logistic regression

Logistic regression, introduced by D R Cox in 1958, has been extensively researched and is widely utilized today for two primary reasons Firstly, it provides insights into the relative importance of features, aiding in the development of intuition about a dataset Secondly, and more significantly, logistic regression functions as a one-neuron neural network, making it a valuable tool in modern data analysis.

Understanding logistic regression is a crucial foundational step towards mastering neural networks and deep learning As a supervised learning algorithm, logistic regression requires target values to be included in the training set's row vectors For example, consider three training cases: x A = (0.2, 0.5, 1, 1), x B = (0.4, 0.01, 0.5, 0), and x C = (0.3, 1.1, 0.8, 0) The number of input neurons in logistic regression corresponds to the number of features in the row vectors, which in this instance is three.

Logistic regression is represented schematically in Fig 3.5 and involves two key equations The first equation, $ z = b + w_1 x_1 + w_2 x_2 + w_3 x_3 $, calculates the logit, or weighted sum, while the second equation describes the logistic or sigmoid function: $ y = \sigma(z) = \frac{1}{1 + e^{-z}} $.

After completing the initial analysis, we may engage in feature engineering and implement a different modeling approach This step is crucial, especially when we lack a comprehensive understanding of the data, a common scenario in many industrial applications.

Logistic regression operates with multiple neurons, as each element of the input vector corresponds to an input neuron However, it is considered to have a single neuron in the context of producing one unified output.

17 If the training set consists of n-dimensional row vectors, then there are exactly n − 1 features—the last one is the target or label.

If we join them and tidy up a bit, we have simply y=σ(b+w1x 1 +w2x 2 +w3x 3 )

The first equation illustrates the calculation of the logit from the inputs, which are typically represented by 'x' in deep learning In this context, the neuron's output is denoted by 'y,' while the logit is represented by 'z' or occasionally 'a.' It's important to recognize the notational conventions commonly used in the machine learning community to fully grasp the significance of these symbols.

To calculate the logit in logistic regression, we require weights (w) and a bias (b) in addition to the inputs The only components that are not inputs or constants are the weights and biases, which are classified as parameters The primary objective of logistic regression is to learn an optimal vector of weights and a suitable bias to enhance classification accuracy Essentially, the core learning process in both logistic regression and deep learning revolves around identifying an effective set of weights.

Weights and biases are essential components in neural networks, where weights determine the significance of each input feature, akin to representing percentages, and can exceed the range of 0 to 1, acting as amplifications The bias, historically referred to as a threshold, influences the output by adjusting the weighted sum of inputs; if the sum exceeds this threshold, the neuron activates Unlike the binary output of 0 or 1, the activation function σ(z) produces a continuous range between 0 and 1 In future discussions, we will explore the integration of bias as part of the weights, but for now, it's important to understand that bias can be effectively incorporated into the weighting system.

Fig 3.6 Historic and actual neuron activation functions

18 Mathematically, the bias is useful to make an offset called the intercept.

64 3 Machine Learning Basics the weights so we can forget about the bias knowing it will be taken care of and it will become one of the weights.

In this article, we explore the mechanics of logistic regression by calculating the output using randomly generated weights and bias We start by initializing the weights and bias with random values between 0 and 1, specifically, a weight vector of w = (0.1, 0.35, 0.7) and a bias of b = 0.66 For our input vectors, we assume they have already been one-hot encoded and normalized, with values xA = (0.2, 0.5, 0.91, 1), xB = (0.4, 0.01, 0.5, 0), and xC = (0.3, 1.1, 0.8, 0) By applying the logistic function, we calculate the output for the first input as yA = σ(0.66 + 0.1 * 0.2 + 0.35 * 0.5 + 0.7 * 0.91), resulting in yA = σ(1.492) = 1.

We note the result 0.8163 and the actual label 1 Now we do the same for the second input: y B =σ(0.66+0.1ã0.4+0.35ã0.01+0.7ã0.5)=σ(1.0535)= 1

Noting again the result 0.7414 and label 0 And now we do it for the last input row vector: y C =σ(0.66+0.1ã0.3+0.35ã1.1+0.7ã0.8)=σ(1.635)= 1

The result of 0.8368 indicates a successful classification for the first input, while the second and third inputs were misclassified To improve our model, we need to update the weights based on our classification errors This requires calculating the level of misclassification using an error function, specifically the sum of squared errors (SSE).

Targets, also known as labels, represent the desired outputs of a model The notation (t(n)) serves as indices that correspond to training samples, indicating that (t(k)) refers to the target for the k-th training row vector This distinction will become clearer shortly.

The Sum of Squared Errors (SSE) is one of the simplest error functions available, although there are other options Understanding the need for unconventional notation is important, but we will also explore how to simplify it later Now, let's proceed to calculate our SSE.

We update the weights using the formula getw=(0.1,0.36,0.3) and bias b=0.25, which will be further explained through the general weight update rule in Chapter 4 This process marks the completion of one cycle of weight adjustment, commonly referred to as an epoch, though we will refine this definition in the upcoming chapter To evaluate the effectiveness of the new weights, we recalculate the outputs and the new sum of squared errors (SSE), resulting in y new A = σ(0.25 + 0.1 * 0.2 + 0.36 * 0.5 + 0.3 * 0.91) = σ(0.723) = 1.

Introducing the MNIST Dataset

The MNIST dataset, derived from the National Institute of Standards and Technology (NIST), consists of handwritten digits and serves as a crucial resource in machine learning research Originally compiled by Yann LeCun, Corinna Cortes, and Christopher J C Burges, MNIST is a modified version of Special Database 1 and Special Database 3 Geoffrey Hinton famously referred to MNIST as "the fruit fly of machine learning" due to its extensive use in various studies and its suitability for simple tasks Today, the dataset can be easily accessed from multiple sources, with Kaggle being one of the most reliable platforms, offering the data in a straightforward CSV format An example of an MNIST digit can be seen in Fig 3.7.

MNIST images are 28 by 28 pixels in greyscale, so the value for each pixel ranges between 0 (white) and 255 (black) This is different from the usual greyscale where

The range of values in the dataset is from 0, representing black, to 255, representing white While the community considered that this approach could save storage space, it is a relatively minor concern given the large size of the MNIST dataset.

One significant issue discussed in this book is that current supervised machine learning algorithms can only accept vector inputs, limiting their ability to process matrices, graphs, and trees Consequently, all data must be converted into n-dimensional vectors for analysis For instance, the MNIST dataset, which contains 28 by 28 pixel images, consists of matrices that can be transformed into 784-dimensional vectors, as they all share the same dimensions.

To convert a 28×28 matrix into a 784-dimensional vector, we read the pixels from left to right, moving to the next line after reaching the end of the current row This straightforward transformation is effective only when all input samples are uniform in size Additionally, to analyze graphs and trees, it's essential to represent them in a vector format.

We will return to this as an open problem at the very end of this book.

An important consideration is the difference between greyscale and RGB images in datasets like MNIST While MNIST uses greyscale images for digit recognition, incorporating RGB images could enhance the model's ability to process and interpret color information, potentially improving accuracy in recognizing patterns and features.

22 See http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec1.pdf.

23 Available at https://www.kaggle.com/c/digit-recognizer/data.

24 The interested reader may look up the details in Chap 4 of [10].

In digital imaging, color images are composed of three channels: red, green, and blue, each represented as values ranging from 0 to 255 Although these channels can be printed in color, they can also be converted to greyscale without apparent loss of information For instance, the red channel is often displayed as a greyscale image, where pixel values indicate intensity rather than color This means that the red channel's pixel values, while labeled as 'red', are essentially greyscale values, with 0 representing black and 255 representing white Thus, separating the red component from an RGB pixel, such as (34, 67, 234), results in a greyscale representation of the value 34.

To achieve 'redness' in an RGB display, it should be represented as (34,0,0) while maintaining the image as an RGB format The same principle applies to green and blue When processing RGB images, there are multiple options available for handling color representation.

• Average the components to produce and average greyscale representation (this is the usual way to create greyscale images from RGB).

To enhance predictive accuracy, separate the channels to create three distinct datasets and train three individual classifiers During the prediction phase, the final outcome is determined by averaging the results of these classifiers, exemplifying the effectiveness of a committee of classifiers.

• Separate the channels in distinct images, shuffle them and train a single classifier on all of them This approach would be essentiallydataset augmentation.

To enhance performance, separate the channels into distinct images and train three instances of the same classifier on each, maintaining identical size and parameters Subsequently, utilize a fourth classifier to make the final decision This methodology is foundational to convolutional neural networks, which will be examined in detail in Chapter 6.

Each approach has its advantages, making them suitable for different problems Exploring various options is essential, as deep learning offers a unique and innovative method that can enhance accuracy significantly.

Learning Without Labels: K-Means

This article focuses on two key algorithms in unsupervised learning: K-means and Principal Component Analysis (PCA) While PCA will be briefly introduced here, a more detailed discussion will follow in Chapter 9, highlighting its significance in deep learning as a fundamental method for building distributed representations In contrast, clustering, which is conceptually simpler, aims to group data points based on their similarity in n-dimensional space K-means, recognized as the simplest clustering algorithm, will serve as an illustrative example of how clustering works.

Unsupervised learning refers to the process of learning without labeled data or specific targets, distinguishing it from supervised and reinforcement learning This broad definition raises intriguing questions about the nature of learning without feedback and whether such learning can be considered genuine Exploring unsupervised learning delves into cognitive modeling, making it a fascinating and vibrant field of study.

K-means is a clustering algorithm that groups similar data points into distinct clusters, typically labeled as '1', '2', '3', and so on In a 2D space with two features, K-means operates in an unsupervised learning environment, where all data points serve as 'training' data without predefined labels The algorithm builds clusters based solely on the features of the input data, defining the hyperplane that separates different clusters.

The K-means algorithm begins by requiring the user to specify the number of centroids, which define the clusters Initially, these centroids are randomly positioned within the data point vector space The algorithm operates in two main phases: the 'assign' phase and the 'minimize' phase, which repeat in a cycle During the 'assign' phase, each data point is allocated to the nearest centroid based on Euclidean distance In the 'minimize' phase, the centroids are adjusted to minimize the total distance of all assigned data points This cyclical process continues until convergence.

25 But PCA itself is not that simple to understand.

26 K-means (also called the Lloyd-Forgy algorithm) was first proposed by independently by S P. Lloyd in [16] and E W Forgy in [17].

27 Usually, in a predefined number of times, there are other tactics as well.

Envision a centroid anchored to its datapoints by rubber bands; when released, it shifts to minimize the overall tension of the bands, despite some individual bands potentially becoming tighter.

In Figure 3.9, we observe two complete cycles of the K-means algorithm, showcasing the process of centroids associating with all data points Although the centroids remain fixed, a new assignment phase initiates, potentially leading to different associations than in the previous cycle Once these cycles conclude, a hyperplane is established, enabling the assignment of new data points to the nearest centroid, effectively labeling them based on proximity.

In unsupervised learning, clustering typically operates without labels, making traditional evaluation metrics ineffective, as we cannot determine true positives, false positives, true negatives, or false negatives However, there are instances when labels may be available, or true labels might be obtained later, allowing us to assess clustering results using classification metrics This approach is known as external evaluation of clustering, and a comprehensive discussion on applying classification evaluation metrics for this purpose can be found in [11].

In situations where labels are unavailable, we can utilize internal evaluation metrics for clustering Among various metrics, the Dunn coefficient is the most widely recognized, as it effectively measures the density of clusters in n-dimensional space The calculation of the Dunn coefficient is performed for each cluster to assess its quality.

29 Recall that a cluster in K-means is a region around a centroid separated by the hyperplane.

The Dunn coefficient is a metric used to assess the quality of clusters in a dataset, defined by the Euclidean distance between centroids $d(i,j)$ and the intra-cluster distance $d_{in}(C)$, which is the maximum distance between any two points within a cluster $C$ By calculating the Dunn coefficient for each cluster, one can evaluate and compare different clusterings by averaging the coefficients across all clusters involved.

Learning Different Representations: PCA

The data we used so far haslocal representations If the value of a feature named

‘Height’ is 180, then that piece of information about that datapoint (we could even say

In data representation, local representations refer to the properties of entities, such as height and weight, where height can provide an epistemic shortcut to estimate weight; for instance, knowing a person is 180 cm tall may suggest they weigh around 80 kg This relationship illustrates correlation, which can complicate the differentiation between features when they are highly correlated To improve data analysis, it is ideal to transform the data to identify features that are unique and uncorrelated In this context, a feature like 'Argh' could capture the underlying relationship between height and weight, while 'Haght' and 'Waght' would represent the remaining, less informative aspects.

‘Height’ and ‘Weight’ after ‘Argh’ was removed from them Such representations are calleddistributed representations.

Building distributed representations manually is challenging, yet it is fundamental to the function of artificial neural networks Each layer creates its own distributed representation, which enhances the learning process—this concept lies at the heart of deep learning We will introduce a straightforward method for constructing meaningful distributed representations, with detailed mathematical explanations to follow in Chapter 9 This method, known as principal component analysis (PCA), simplifies the creation of distributed representations, which is why deep learning is preferred for this task In this chapter, we will provide an overview of PCA.

30 We have to use the same number of centroids in both clusterings for this to work.

31 These features are known as latent variables in statistics. the details in Chap.9 32 PCA has the following form:

In the equation Z = XQ (3.17), Z represents the transformed matrix, X is the input matrix, and Q is the transformation tool matrix For an n×d input matrix X, the resulting matrix Z must also be n×d, indicating that Q must be a d×d matrix to facilitate proper multiplication In Chapter 9, we will explore how to determine the suitable Q This section will provide an overview of the intuition behind Principal Component Analysis (PCA), outline the necessary components for constructing Q, and clarify the objectives and applications of PCA.

PCA, or Principal Component Analysis, is an essential technique for data preprocessing, transforming data to enhance its suitability for classifiers By building distributed representations, PCA effectively reduces correlation within the dataset, making the data more manageable and interpretable.

PCA can effectively reduce dimensionality by addressing the challenges posed by one-hot encoding and manual feature engineering When creating distributed representations with artificial features like 'Argh', 'Haght', and 'Waght', it's crucial to rank these features based on their informativeness, allowing us to eliminate those that provide little value Informativity is defined by variance; features with higher variance convey more information Therefore, our goal is to organize the feature matrix $Z$ so that the feature with the highest variance occupies the first column, followed by the second highest in the next column, and so forth.

In Figure 3.10, we demonstrate how variance can be altered through simple transformations using six 2D datapoints Part A shows the initial position, where the variance along the x-axis is minimal, indicating that the datapoints are closely clustered In contrast, the y-axis exhibits greater variance, with the y-coordinates more spread out By rotating the coordinate system, as illustrated in Part B, we enhance the representation of the data while keeping the actual datapoints unchanged This adjustment reflects a different basis for the points in the 2D vector space rather than a modification of the points themselves The mathematical approach to achieving this involves finding a matrix Q, which will be explored further in Chapter 9 Additionally, the distance between the first and last datapoint coordinates along the axes serves as a graphical representation of variance.

32 One of the reasons for this is that we have not yet developed all the tools we need to write out the details now.

34 And if a feature is always the same, it has a variance of 0 and it carries no information useful for drawing the hyperplane.

Fig 3.10 Variance under rotation of the coordinate system

In Figure 3.10, a comparison is made between the original black coordinate system and the transformed grey system The analysis reveals that the variance along the y-axis, which initially exhibited higher variance in the original system, has increased Conversely, the variance along the x-axis, which originally had lower variance, has decreased.

PCA, or Principal Component Analysis, is essential for addressing the issue of noise in data, which is defined as any information that is not relevant In a well-structured dataset with sufficient training samples, noise and relevant information are typically intertwined within the features By utilizing PCA, we can create a distributed representation that allows us to separate features based on their variance; noise, being random, generally exhibits low variance, while relevant information shows high variance For instance, when applying PCA to a 20-dimensional input matrix, retaining the first 10 new features helps eliminate significant amounts of noise, as these low variance features contain minimal relevant information, ensuring that the integrity of the dataset is largely preserved.

Principal Component Analysis (PCA), first introduced by Karl Pearson at University College London in 1901, has a rich history and has evolved through various adaptations over time While the nuances and relationships among these PCA variants are intriguing, they are complex enough to warrant extensive exploration beyond the scope of this article.

Learning Language: The Bag of Words Representation

In this article, we have explored numerical, ordinal, and categorical features, including the process of one-hot encoding for categorical data However, we have yet to cover the area of natural language processing (NLP) For a comprehensive introduction to NLP, we recommend consulting sources [14] or [15] In this section, we will focus on processing language using one of the simplest models: the bag of words.

In natural language processing, a corpus refers to a comprehensive collection of texts, which can be divided into fragments such as sentences, paragraphs, or entire documents Each fragment serves as a training sample; for example, a patient admission document could represent one fragment in clinical analysis, while a 200-page PhD thesis might be a fragment in academic research In sentiment analysis on social media, each user comment is treated as a separate fragment The bag of words model is created by converting each word in the corpus into a feature and counting its occurrences in each fragment, although this approach results in the loss of word order.

The bag of words model is a fundamental technique for transforming language into features suitable for machine learning algorithms, with deep learning offering viable alternatives discussed in later chapters While other machine learning methods predominantly rely on the bag of words or its variations, this model remains effective for numerous language processing tasks, even within deep learning contexts To illustrate its application, we will examine a straightforward social media dataset.

We need to convert the column ‘Comment’ into a bag of words The columns

To generate a bag of words from the comments, we will perform two passes The first pass involves gathering all unique words to create features, while the second pass populates these features with their corresponding values.

An expansion of the basic bag of words model is the bag of n-grams, which consists of n-tuples of consecutive words For instance, in the sentence "I will go now," the 2-grams are {('I', 'will'), ('will', 'go'), ('go', 'now')}.

For effective language processing, particularly with data from social media, it's essential to convert all text to lowercase and remove commas, apostrophes, and non-alphanumeric characters.

User you dont know as if i what Likes

To prepare our dataset for machine learning, we first create a bag of words from the 'Comment' column Next, we perform one-hot encoding on the 'User ' column This process allows us to generate the final input matrix needed for our algorithm.

S A F F P H you dont know as if i what Likes

This article highlights the key differences between one-hot encoding and the bag of words approach in natural language processing One-hot encoding represents each word with a binary vector, where each row contains only a single '1' and the rest '0's, allowing for a compact representation by simply noting the column index of the '1' In contrast, the bag of words method counts the occurrences of each word across the entire dataset, requiring the integration of both training and test sets to ensure consistency This approach can lead to issues when new words appear in the test set that were not present in the training set, resulting in zero counts for those words Additionally, since most classifiers demand uniform dimensionality and feature names, any new words not included in the trained model must be discarded during prediction.

Both techniques significantly increase the dimensions of the data, resulting in a sparse encoding where most feature values are zero This indicates that many features may be irrelevant, and it's crucial for our classifier to eliminate them efficiently We will explore how methods like PCA and L1 regularization can effectively handle sparsely encoded datasets Additionally, we utilize dimensional expansions to capture semantics through word counts.

1 R Tibshirani, T Hastie, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edn (Springer, New York, 2016)

2 F van Harmelen, V Lifschitz, B Porter, Handbook of Knowledge Representation (Elsevier Science, New York, 2008)

3 R.S Sutton, A.G Barto, Reinforcement Learning: An Introduction (MIT Press, Cambridge, 1998)

4 J.R Quinlan, Induction of decision trees Mach Learn 1 , 81–106 (1986)

5 M.E Maron, Automatic indexing: an experimental inquiry J ACM 8 (3), 404–417 (1961)

6 D.R Cox, The regression analysis of binary sequences (with discussion) J Roy Stat Soc B (Methodol.) 20 (2), 215–242 (1958)

7 P.J Grother, NIST special database 19: handprinted forms and characters database (1995)

8 Y LeCun, L Bottou, Y Bengio, P Haffner, Gradient-based learning applied to document recognition Proc IEEE 86 (11), 2278–2324 (1998)

9 M.A Nielsen, Neural Networks and Deep Learning (Determination Press, 2015)

10 P.N Klein, Coding the Matrix (Newtonian Press, London, 2013)

11 I Fọrber, S Gỹnnemann, H.P Kriegel, P Kroửger, E Mỹller, E Schubert, T Seidl, A Zimek.

On using class-labels in evaluation of clusterings, in MultiClust: Discovering, Summarizing, and Using Multiple Clusterings, ed by X.Z Fern, I Davidson, J Dy (ACM SIGKDD, 2010)

12 J Dunn, Well separated clusters and optimal fuzzy partitions J Cybern 4 (1), 95–104 (1974)

13 K Pearson, On lines and planes of closest fit to systems of points in space Phil Mag 2 (11), 559–572 (1901)

14 C Manning, H Schütze, Foundations of Statistical Natural Language Processing (MIT Press, Cambridge, 1999)

15 D Jurafsky, J Martin, Speech and Language Processing (Prentice Hall, New Jersey, 2008)

16 S P Lloyd, Least squares quantization in PCM IEEE Transactions on Information Theory,

17 E W Forgy, Cluster analysis of multivariate data: efficiency versus interpretability of classifications Biometrics, 21 (3), 768–769 (1965)

Basic Concepts and Terminology for Neural Networks

Backpropagation is the fundamental learning method in deep learning, which involves deep artificial neural networks Before diving into backpropagation, it's essential to understand basic concepts and their interactions Deep learning addresses issues that arise when adding layers to shallow neural networks, also known as simple feedforward neural networks While convolutional neural networks are considered feedforward, they are not classified as shallow For further reading, we recommend several notable books: [1] provides a detailed mathematical approach, [2] focuses on applications and related techniques like Adaline, [3] is authored by leading experts in deep learning, serving as a natural progression after this volume, and [4] is a challenging read best tackled after [3] This selection aims to enhance the material presented in the current work.

Any neural network is made of simple basic elements In the last chapter, we encountered a simple neural network without even knowing it: the logistic regression.

A shallow artificial neural network typically comprises two or three layers, while networks with more layers are classified as deep Similar to logistic regression, an artificial neural network features an input layer that stores the inputs, with each input element referred to as a 'neuron.' In logistic regression, there is a single point where all inputs converge.

A neural network consists of an output layer with multiple neurons, unlike logistic regression, which typically has a single output In neural networks, a 'hidden' layer can exist between the input and output layers, allowing for more complex computations This structure can be viewed as a logistic regression model with several intermediate neurons, followed by a final neuron that integrates their outputs Alternatively, it can be seen as an extension of logistic regression, incorporating a layer of neurons between the inputs and the original output neuron Both perspectives are valuable for understanding neural networks, and we will shift between them throughout this chapter for clarity.

A simple three-layer neural network consists of interconnected neurons where each neuron in one layer connects to all neurons in the subsequent layer The connections are defined by weights that determine the influence of a neuron's output on the next layer's neurons Importantly, these weights are unique to each neuron pair, meaning the weight between neuron N5 and neuron M7 differs from the weight between neuron N5 and neuron M3 While it's possible for weights to be the same by chance, they typically vary across different connections.

In a neural network, information flows from the input layer, which consists of three neurons (x1, x2, x3), to the hidden layer and finally to the output neurons The primary function of the input layer is to accept input values for these variables, serving as the initial stage of the network's processing.

Neural networks consist of layers that can process inputs, with the number of input neurons determining the maximum input values accepted While it's permissible to have fewer input values than neurons—allowing unused neurons to receive zero—exceeding the number of input neurons is not allowed Inputs can be represented as a sequence (x₁, x₂, , xₙ) or as a column vector x:=(x₁, x₂, , xₙ)⊤ These representations are interchangeable, and the choice between them is based solely on computational efficiency, aiming to facilitate faster and easier operations.

In a neural network, each neuron in the input layer connects to every neuron in the hidden layer, while neurons within the same layer remain unconnected Each connection between a neuron in layer k and a neuron in layer n is assigned a weight, represented as wjm, which influences the amount of initial input value transmitted to the corresponding neuron.

In a neural network, if a neuron has an output value of 12 and the weight assigned to the destination neuron is 0.25, the destination neuron will receive a value of 3 Weights can both diminish and enhance this value, as they are not restricted to a range between 0 and 1.

In Figure 4.1, we examine neuron 3 from layer 2, which processes input as the sum of the products of inputs from the previous layer and their corresponding weights The inputs are x3, x2, and x3, with weights w13, w23, and w33 Each neuron also includes a modifiable value known as the bias, represented as b3, which is added to the sum of the weighted inputs The final output of this calculation is referred to as the logit, traditionally denoted by z, which in this instance is z23.

Many models in machine learning utilize a nonlinear activation function, often represented as ‘S’, to transform the logit output into a final result This output is typically denoted as y, with y23 representing the output of a specific neuron The activation function, which can be referred to generically as S(x), is commonly the sigmoid or logistic function This function, previously encountered in logistic regression, converts the logit z into an output value using the formula σ(z) = 1.

‘squashes’ all it receives to a value between 0 and 1, and the intuitive interpretation of its meaning is that it calculates the probability of the output given the input.

Different layers in a neural network can exhibit varying nonlinearities, but all neurons within the same layer apply the same nonlinearity to their logits Each neuron's output remains consistent across all directions it transmits For instance, in the zoomed view of the neuron in Fig 4.1, the value y23 is sent in two directions, maintaining the same value Additionally, as illustrated in Fig 4.1, the logits in the subsequent layer are computed similarly; for example, z31 is determined by the equation z31 = b31 + w23_11 * y21.

1 These models are called linear neurons.

In our analysis of linear neurons, we maintain consistent notation by defining $ y_{23} := z_{23} $ The equation can be expressed as $ w_{21} \cdot y_{22} + w_{23} \cdot y_{23} + w_{23} \cdot y_{24} $ This process is similarly applied to $ z_{32} $ By subsequently applying the selected nonlinearity to $ z_{31} $ and $ z_{32} $, we derive the final output.

Representing Network Components with Vectors

Let us recall the general shape of am×nmatrix (mis the number of rows andnis the number of columns):

Suppose we need to define with matrix operations the process sketched in Fig.4.2.

In Chapter 3, we explored the use of matrix operators for logistic regression calculations, and we will apply the same principles to simple feedforward neural networks To represent the input in a vertical format, we can use a column vector, such as $ x = (x_1, x_2)^\top $ Figure 4.2 illustrates the intermediate values within the network, allowing us to track each calculation step As previously mentioned, for a matrix $ A $, the entry in the $ j $-th row and $ k $-th column is indicated by $ A_{j,k} $ or $ A_{jk} $ To interchange $ j $ and $ k $, we use the transpose of the matrix, denoted $ A^\top $, where $ A_{jk} = A^\top_{kj} $ When performing operations in neural networks with vectors and matrices, it is essential to minimize transpositions due to their computational cost while striving to maintain simplicity and clarity Although matrix transposition is not overly costly, prioritizing intuitive representations is often beneficial, particularly when representing a weight $ w $ that connects the layers.

In neural networks, the connections between neurons are represented using vectors and matrices, such as the variable w23, which indicates the connection between the second neuron in layer 1 and the third neuron in layer 2 While the index provides information on the connected neurons, the identification of the layers involved is effectively stored in the matrix name within the program code, for example, input_to_hidden_w This naming convention allows for clear organization and understanding of the relationships between different layers in the network.

‘mathematical name’, e.g.uor by its ‘code name’ e.g.hidden_to_output_w.

So, following Fig.4.2we write the weight matrix connecting the two layers as: w 11 (=0.1) w12(=0.2) w13(=0.3) w 21 (=1) w 22 (=2) w 23 (=3)

Let us call this matrix w (we can add subscripts or superscripts to its name). Using matrix multiplicationw ⊤ xwe get a 3×1 matrix, namely the column vector z=(21,42,63) ⊤

The forward pass in a neural network refers to the series of calculations performed as input data travels through the network, where each layer computes a specific function If we denote the input vector as $ x $, the output vector as $ y $, and the functions calculated at each layer as $ f_i $, $ f_h $, and $ f_o $, we can express the relationship as $ y = f_o(f_h(f_i(x))) $ This conceptualization of neural networks will be crucial for understanding weight correction via backpropagation.

For a full specification of a neural network we need:

• The number of layers in a network

• The size of the input (recall that this is the same as the number of neurons in the input layer)

• The number of neurons in the hidden layer

• The number of neurons in the output layer

Neurons in a neural network are represented as entries in a matrix, and their quantity is essential for defining these matrices The critical components of a neural network are the weights and biases, as the primary objective is to optimize them through training via backpropagation This process involves assessing the classification errors made by the network and adjusting the weights accordingly to minimize these errors The following sections will focus on backpropagation, which is a fundamental concept in deep learning, and will be introduced gradually with multiple examples for clarity.

The Perceptron Rule

The learning process in artificial neurons involves updating weights and biases through backpropagation during training, while classification relies solely on the forward pass One of the foundational learning methods for artificial neurons is perceptron learning, which features a binary threshold neuron, often likened to modified logistic regression A binary threshold neuron can be formally defined by the equation z = b + Σ(w_i * x_i), where the output y is determined by the condition z ≥ 0.

In this article, we explore the components of a neural network, where $ x_i $ represents the inputs, $ w_i $ are the weights, $ b $ is the bias, and $ z $ is the logit The decision-making process typically involves a nonlinearity; however, we utilize a binary step function in this context We also demonstrate how the bias can be integrated as one of the weights by introducing an additional input $ x_0 $ with a value of 1, effectively allowing us to simplify the weight update process This approach shows that the equation $ z = b + \sum (w_i x_i) $ can be rewritten as $ z = w_0 x_0 + w_1 x_1 + w_2 x_2 + \ldots $, illustrating the equivalence of bias and weight.

In the given equation, the variable $ b $ can represent either $ x_0 $ or $ w_0 $, with the other variable set to 1 To enable the adjustment of bias through learning while keeping the inputs constant, we must consider bias as a weight This approach is referred to as bias absorption.

4.3 The Perceptron Rule 85 The perceptron is trained as follows (this is the perceptron learning rule 3 ):

2 If the predicted output matches the output label, do nothing.

3 If the perceptron predicts a 0 and it should have predicted a 1, add the input vector to the weight vector

4 If the perceptron predicts a 1 and it should have predicted a 0, subtract the input vector from the weight vector

As an example, take the input vector to bex=(0.3,0.4) ⊤ and let the bias be b=0.5, the weightsw=(2,−3) ⊤ and the target 4 t=1 We start by calculating the current classification result: z=b+ i wixi=0.5+2ã0.3+(−3)ã0.4= −0.1

When the perceptron's output is 0 instead of the expected 1, it indicates the need to apply clause (3) of the perceptron rule, which involves adding the input vector to the weight vector.

The perceptron algorithm has significant limitations when handcrafted features are unavailable Minsky and Papert highlighted this in 1969, illustrating that each classification problem can be viewed as a data query based on specific properties we want the input to meet Machine learning serves to define these complex properties through the numerical attributes of the input data For instance, in a dataset containing individuals' heights and weights, a query could filter for those taller than 175 cm using a command like "select * from table where cm > 175." Conversely, if we only have jpg files of mugshots with a black and white meter in the background, a classifier would be necessary to assess the individuals' heights based on pixel data rather than numerical values.

The machine learning algorithm interprets 'similarity' based on the information it receives, which can differ between numerical and visual representations For instance, while heights of 155 cm and 175 cm may appear similar in certain contexts, they are distinct from 165 cm due to the differences in their background patterns This highlights that what seems numerically close, like the digits 6 and 9, may not be visually similar without transformation, emphasizing the complexity of similarity in data representation.

3 Formally speaking, all units using the perceptron rule should be called perceptrons, not just binary threshold units.

The target, also known as the expected value or true label, is typically represented by the symbol 't' In machine learning, particularly with algorithms like perceptrons, data representation can affect classification; for instance, if data is represented in pixels and can be rotated, the algorithm may interpret them as identical despite numerical differences During the classification process, the algorithm categorizes certain data points as belonging to a specific class, assigning labels of 1 or 0 This partitioning aims to accurately reflect the underlying reality, where data points labeled as 1 are genuinely 'ones' and those labeled as 0 are truly 'zeros'.

Parity is a fundamental concept in logic and theoretical computer science that focuses on binary strings, selecting those with an equal number of ones and zeros and labeling them as 1 This concept can be specified for strings of a certain length, denoted as parity n (x0, x1, , xn), where each xi represents a binary digit Specifically, parity 2, known as XOR (exclusive disjunction), is a logical function that returns 1 only when there is one 1 and one 0 in the input Additionally, the logical equivalence can be applied, allowing for the exchange of the resulting 0 and 1, as these labels merely represent different classes without altering the fundamental meaning.

The perceptron struggles to classify XOR problems and similar parity instances due to its inability to adjust weights effectively for the input bits Specifically, with two input neurons designated for the XOR operation, the perceptron cannot separate the output labels of 1 and 0 Formally, considering the weights $ w_1 $, $ w_2 $, and bias $ b $, the perceptron fails to learn the correct mappings, such as (0,0)→1 and (0,1)→0.

The inequality (a) holds since if(x 1 =1,x 2 =1)→1, and we can get 1 as an output only ifw 1 x 1 +w 2 x 2 =w 1 ã1+w 2 ã1=w 1 +w 2 is greater or equalb, which meansw 1 +w 2 ≥b.

The inequality (b) holds since if(x 1 =0,x 2 =0)→1, and we can get 1 as an output only ifw 1 x 1 +w 2 x 2 =w 1 ã0+w 2 ã0=0 is greater or equalb, which means

The inequality (c) holds since if(1,0)→0, thenw 1 x 1 +w 2 x 2 =w 1 ã1+w 2 ã

0=w 1 , and for the perceptron to give 0,w 1 has to be less than the biasb, i.e.w 1 Lf i nal are clipped to

To ensure that all inputs meet the specified final length, zeros are added to the right side of each input This padding technique allows for the preservation of more recent information, as the authors implemented a reversing method that prioritizes retaining the latest data while clipping the older, less relevant information at the beginning.

To create a Keras-friendly dataset, start by converting your MbyL final matrices into a tensor format This involves stacking 1000 MbyL final matrices along a new third dimension, resulting in a 1000 by MbyL final tensor Initialize this tensor as a 3D Numpy array filled with zeros, and then develop a function to assign a value of 1 to the appropriate positions Implement this architecture using Keras code, and if you encounter any challenges, don't hesitate to seek assistance on StackOverflow.

If you're new to this task, it may take you up to a week to achieve the desired outcome, despite the simplicity of the code involved Engaging in this exercise is an excellent opportunity to deepen your understanding of deep learning, so be sure to embrace it.

1 Y LeCun, L Bottou, Y Bengio, P Haffner, Gradient-based learning applied to document recognition Proc IEEE 86 (11), 2278–2324 (1998)

12 A couple of hours each day—not a literal week.

2 D.H Hubel, T.N Wiesel, Receptive fields and functional architecture of monkey striate cortex.

3 X Zhang, J Zhao, Y LeCun, Character-level convolutional networks for text classification, inAdvances in Neural Information Processing Systems 28, NIPS (2015)

Sequences of Unequal Length

Feedforward neural networks efficiently handle vectors, while convolutional neural networks are designed for matrices that can be converted into vectors However, processing sequences of varying lengths presents a challenge For instance, when dealing with images of different sizes, a common solution is to rescale them to a uniform dimension.

When dealing with images of different sizes, such as an 800 by 600 pixel image and a 1600 by 1200 pixel image, resizing is a straightforward solution To reduce the larger image, we can either average four pixels or use max-pooling techniques Conversely, enlarging the smaller image involves interpolating pixels to maintain quality However, if the images do not scale well, adjustments may be necessary to ensure visual consistency.

When resizing images, such as expanding a 600 by 800 image to 555, the deformations typically do not impact image processing, as the essential shapes are preserved However, resizing could affect a neural network's ability to classify shapes accurately, particularly in distinguishing between ellipses and circles, since this could alter the appearance of circles Additionally, if all matrices analyzed are of uniform size, they can be represented as long vectors, as demonstrated in the MNIST section Conversely, varying sizes prevent effective encoding as vectors, resulting in inconsistent row lengths.

To effectively process images in a neural network, it's essential to convert them into fixed-dimensional vectors For instance, a 20 by 20 image can be represented as a 400-dimensional vector, where the second pixel in the third row corresponds to the 43rd component However, when dealing with images of different sizes, such as a 20 by 20 image and a 30 by 30 image, the challenge arises in fitting vectors of varying dimensions—400 for the first image and 900 for the second—into a neural network that requires uniform input dimensions This discrepancy complicates the integration of diverse image data in machine learning models.

The challenge of varying dimensionality in audio processing involves learning sequences of unequal lengths, as audio clips inherently differ in duration While one might consider standardizing all clips to the length of the longest, this approach wastes storage space Moreover, silence plays a crucial role in communication; for instance, adding silence to a sound clip can alter its meaning, making an existing label potentially inappropriate This is particularly relevant in contexts where nuances like irony and sarcasm are present, highlighting the importance of accurately representing audio sequences.

To address the limitations of traditional neural networks, we need to explore a new architecture: recurrent neural networks (RNNs) Unlike feedforward neural networks that only push information forward, RNNs incorporate feedback loops that allow outputs to be fed back as inputs, enabling them to process sequences of varying lengths This design not only deepens the network but also helps mitigate the vanishing gradient problem by sharing weights Historically, the concept of multi-layer perceptrons emerged when the perceptron model fell short, but it remained theoretical until after the introduction of backpropagation in 1986 Prior to this breakthrough, researchers experimented with adding layers and feedback loops, laying the groundwork for the development of RNNs.

J J Hopfield introduced Hopfield networks, marking the inception of successful recurrent neural networks before the advent of backpropagation These networks differ from contemporary recurrent neural networks, with long short-term memory networks (LSTMs) being the most significant advancement since their invention by Hochreiter and Schmidhuber in 1997 LSTMs continue to dominate the field, achieving state-of-the-art results across various applications, including speech recognition and machine translation This chapter will delve into the essential concepts needed to thoroughly understand LSTMs.

The Three Settings of Learning with Recurrent Neural

The naive Bayes classifier, as discussed in Chapter 3, computes the probability P(target|features) by first determining P(feature1|target) This approach highlights the classifier's reliance on conditional probabilities to make informed predictions based on the given features.

P(f eat ur e2|t ar get), etc., from the dataset This is how the naive Bayes classifier works, but all classifiers (supervised learning algorithms) try to calculate

P(t ar get|f eat ur es)orP(t|x)in some way Recall that any predicatePsuch that

7.2 The Three Settings of Learning with Recurrent Neural Networks 137

(i)P(A)≥0, (ii)P()=1, whereis the possibility space and (iii) for all disjoint

A n ,n ∈ N,P(∞ n=1A n )=∞ n=1P(A n )is a probability predicate Moreover, it is theprobability predicate (try to work out the why by yourself).

From a probabilistic viewpoint, supervised machine learning algorithms estimate the probability P(t|x), where x represents the input vector and t signifies the target vector This approach characterizes the fundamental framework of supervised learning with labeled data Recurrent neural networks (RNNs) excel in this context by processing extensive labeled sequences, enabling them to predict the label for each completed sequence effectively.

Recurrent neural networks (RNNs) excel in classifying audio clips by emotions and can also handle complex tasks involving multiple labels For instance, consider training an industrial robotic arm equipped with numerous sensors to follow directional commands—North, South, East, and West The training dataset consists of movement sequences, such as "x1 N x2 N x3 W x4 E x5 W x6 W" or simply "x1 N x2 W," highlighting the unique nature of the input data where sensor readings and directional movements are intertwined.

In the context of recurrent neural networks (RNNs), it is crucial to maintain the integrity of sequences, as breaking them can lead to significant errors in prediction The Markov assumption suggests that the next state relies solely on the current state, but RNNs excel because they do not adhere to this limitation Instead, they can model intricate behaviors by learning from uneven sequences with labeled segments, enabling them to generate multiple labels when predicting unknown vectors This approach is referred to as the sequential setting, highlighting the RNN's capability to handle complex data relationships effectively.

The predict-next setting is an advanced version of the sequential setting, primarily utilized in natural language processing without explicit labels Instead of traditional labels, this approach employs implicit ones by breaking down each input sequence into subsequences, using the subsequent word as the target To implement this, special tokens for the start and end of sentences, denoted as $ (‘start’) and & (‘end’), are manually added For example, the sentence ‘All I want for Christmas is you’ is transformed into ‘$ all I want for Christmas is you &’, allowing the network to effectively process inputs and targets.

In machine learning, the notation y typically represents the results from a predictor, while y is reserved for target values However, in this article, we adopt a notation more aligned with deep learning practices, where y signifies the outputs from the predictor, and t denotes the actual values or targets.

2 Notice which capital letters we kept and try to conclude why.

• (‘$ all I want for Christmas’, ‘is’)

• (‘$ all I want for Christmas is’, ‘you’)

• (‘$ all I want for Christmas is you’, ‘&’).

The recurrent network learns to predict the most probable next word following a sequence of words, effectively modeling a probability distribution from the input data, denoted as P(x) This process exemplifies unsupervised learning, as it does not rely on predefined targets; instead, targets are generated from the input sequences themselves.

To enhance AI's question-answering capabilities, it's essential to limit the word count of the input string This limitation is significant as it contributes to the development of general AI, aligning with the principles of the Turing test However, a crucial adjustment is necessary: if a recurrent network identifies the most likely word after a sequence, it risks becoming repetitive.

Now, the recurrent neural network would conclude that P(Mar cus) = 0.6,

In a recurrent neural network (RNN), the probability of selecting a specific outcome can be influenced by its training data For instance, with probabilities P(M yr on) = 0.2 and P(Cassi d y) = 0.2, the model might consistently choose 'Marcus' after the sequence 'My name is' due to its higher likelihood However, the key is for the RNN to create a probability distribution for each input sequence, considering all possible outcomes rather than solely relying on the highest probability This approach allows for more diverse outputs, with 'Marcus' being selected around 60% of the time, while also enabling the generation of other names.

The names 'Myron' and 'Cassidy' highlight the importance of variability in responses generated by recurrent neural networks, addressing potential issues that could arise from uniform answers to identical word sequences With a foundational understanding established, we can now explore the intricate mechanics behind recurrent neural networks.

Adding Feedback Loops and Unfolding a Neural Network

7.3 Adding Feedback Loops and Unfolding a Neural Network

Recurrent neural networks (RNNs) address the vanishing gradient problem encountered in traditional neural networks, where adding layers can hinder weight learning through gradient descent due to minimal updates Unlike convolutional neural networks (CNNs), which utilize shared weights to facilitate learning, RNNs are designed for sequential data, making them ideal for tasks involving time-series or natural language processing However, CNNs excel in image processing due to their specific architecture, highlighting the distinct advantages of RNNs for sequence-based applications.

Recurrent neural networks enhance simple feedforward neural networks by incorporating recurrent connections within the hidden layer, as illustrated in the comparison of Figures 7.1a and 7.1b In this setup, outputs from the hidden layer are represented as H1, H2, H3, H4, H5, etc., while the simple feedforward network uses I, O, and H It is crucial to distinguish between multiple outputs from a hidden layer and multiple hidden layers, as each layer is defined by its unique set of weights; in this case, all Hn share the same weights, denoted as wh Figure 7.1c further simplifies the representation by converting individual neurons into vectors The addition of recurrent connections necessitates the introduction of a new set of weights, wh, which is essential for integrating recurrence into the network.

Recurrent neural networks (RNNs) can be unfolded to clearly depict their recurrent connections This process allows for a detailed representation of the network's structure, as illustrated in the accompanying figures Figure 7.2a presents the original RNN, while Figure 7.2b demonstrates the unfolded version, and Figure 7.2c showcases the same unfolded network with additional modifications.

Fig 7.1 Adding recurrent connections to a simple feedforward neural network

In this section, we will utilize the detailed notation established in the recurrent neural network literature, specifically referencing sub-image C from Fig 7.2 This representation will serve as our standard notation throughout the chapter, allowing us to effectively discuss the functionality of recurrent neural networks.

Elman Networks

In the context of recurrent neural networks, Fig 7.2c illustrates the roles of input weights (w x), recurrent connection weights (w h), and hidden-to-output weights (w o) The inputs (x) and outputs (y) are sequentially processed, with x(1) being the first input followed by x(2), and so on, mirroring the output sequence In traditional settings, only x(1) and y(4) would be utilized for input and overall output, respectively However, for sequential and predict-next configurations, all inputs (x) and outputs (y) are employed Additionally, the inclusion of hidden states (h) distinguishes this model from simple feedforward networks, as h serves as inputs for the recurrent connections To initiate the process, h(0) is generated by setting all its entries to zero.

In this article, we present a comprehensive example calculation that illustrates the process of calculating all elements in a cohesive manner, offering greater insight than fragmented calculations We denote nonlinearity with the symbol 'f', which can be conceptualized as the logistic function.

A bit later we will see a new nonlinearity called softmax, which can be used here and has natural fit with recurrent neural networks So, the recurrent neural network

3 We used the shades of grey just to visually denote the gradual transition to the proper notation.

7.4 Elman Networks 141 calculates the output y at a final time t The calculation can be unfolded to the following recursive structure (which makes it clear why we needh(0)): y(t)= f(w ⊤ o h(t))= (7.1)

The equations governing the recurrent neural network, known as Elman networks, can be simplified into two main formulas: h(t) = f_h(w⊤h h(t−1) + w⊤x x(t)) and y(t) = f_o(w⊤o h(t)) In these equations, f_h represents the nonlinearity of the hidden layer, while f_o denotes the nonlinearity of the output layer Although f_h and f_o can be different, they may also be the same if desired This model is named after linguist and cognitive scientist Jeffrey L Elman.

If we change theh(t−1)fory(t−1)in Eq.7.5, so that it becomes as follows: h(t)= fh(w ⊤ h y(t−1)+w ⊤ x x(t)) (7.7)

The Jordan network, named after psychologist and cognitive scientist Michael I Jordan, is a type of simple recurrent network (SRN) that, despite being less common in modern applications, serves as an essential teaching tool for understanding more complex recurrent architectures like LSTMs Initially, SRNs represented a significant advancement in language processing by allowing the model to operate directly on word sequences without relying on external representations like bag-of-words or n-grams, which do not align with human language comprehension This shift towards treating language as a sequence of words marked a pivotal moment in artificial intelligence, making what was once deemed impossible now achievable Although LSTMs have since surpassed SRNs in practical applications due to their enhanced capabilities, they require longer training times, highlighting the trade-off between performance and efficiency in AI development.

Long Short-Term Memory

This section provides a visual representation of long short-term memory (LSTM) networks, enabling readers to code LSTMs from scratch using our detailed explanations and accompanying images The illustrations are adapted from Christopher Olah’s blog, maintaining similar notation with minor adjustments While we omit weights in the initial figures for clarity, we will include them when discussing specific LSTM components later Notably, in this chapter, y(t) is equivalent to h(t), as indicated by Eq 7.5, although we will highlight instances where h(t) is multiplied by w o to derive y(t) for clarity, ensuring a better understanding of the concepts presented.

Figure 7.3 provides a comparative overview of Long Short-Term Memory networks (LSTMs) and Simple Recurrent Networks (SRNs) A key distinction is that SRNs feature a single connection between units, represented by the flow of h(t), while LSTMs incorporate both h(t) and an additional component, C(t) The C(t), known as the cell state, serves as the primary conduit for information within LSTMs, symbolizing the essential elements of their architecture.

‘LSTM’, i.e it is thelong-term memoryof the model Everything else that happens is just different filters to decide what should be kept or added to the cell state The

Fig 7.3 SRN and LSTM units zoomed

4 http://colah.github.io/posts/2015-08-Understanding-LSTMs/, accessed 2017-03-22.

Figure 7.4 illustrates the cell state (a), forget gate (b), input gate (c), and output gate (d) in a neural network architecture, with a focus on the cell state depicted in Fig 7.4a For the moment, the functions f(t) and i(t) shown in the image can be set aside, as their calculations will be explained in the following paragraphs.

LSTM networks utilize gates to manage the addition and removal of information within their cells, forming the core of the unit These gates operate through a straightforward combination of addition, multiplication, and nonlinear functions, which serve to compress the information Specifically, the logistic or sigmoid function, represented as SIGM in diagrams, is employed to constrain values between 0 and 1, effectively 'squashing' the information for better processing.

0 and 1, and the hyperbolic tangent (denoted as TANH in the images) is used to

‘squash’ the information to values between−1 and 1 You can think of it in the following way: SIGM makes a fuzzy ‘yes’/‘no’ decision, while TANH makes a fuzzy

‘negative’/‘neutral’/‘positive’ decision They do nothing else except this.

The forget gate, illustrated in Fig 7.4b, plays a crucial role in controlling the retention of information within a neural network Denoted as f(t), it is defined by the equation f(t) := σ(w f(x(t) + h(t−1))), where σ represents the logistic function This gate determines the extent to which the weighted raw input and the weighted previous hidden state are remembered, drawing an analogy to logic gates in digital circuits.

When discussing weights in deep learning, various approaches exist, but the most intuitive method involves breaking them down into distinct categories: w_f, w_ff, w_C, and w_fff It's essential to recognize that different perspectives on weights may retain terminology from simpler models; however, the most natural way to conceptualize deep learning architectures is as compositions of these specialized weights.

In LSTMs, the weight $ w_f $ is equivalent to $ w_x $ in SRNs, rather than being a part of the previous weight $ w_h $ Neural networks can be viewed as a collection of fundamental 'building blocks' that function like LEGO® bricks, each possessing its own set of weights During the training process, all weights in the network are optimized together through backpropagation, creating a cohesive structure akin to how LEGO bricks interconnect to form a complete design.

The input gate, illustrated in Fig 7.4c, plays a crucial role in determining what information to incorporate into the cell state This gate includes a forget gate, denoted as ff(t), which operates with distinct weights to manage the retention of input data Additionally, it features a module that generates candidate values for integration into the cell state Essentially, ff(t) functions as a mechanism for saving, regulating the extent of input data retained in the cell state.

To calculate the candidates, denoted as C ∗ (t), we use the formula C ∗ (t) := τ(w C ã(x(t) + h(t−1))), where τ represents the hyperbolic tangent function This function compresses the results into a range between -1 and 1 The negative values, spanning from -1 to 0, serve as a mechanism for rapid 'negations', allowing for the efficient processing of linguistic antonyms and opposites.

As we have seen before, an LSTM unit has three outputs:C(t),y(t)andh(t) We have all we need to compute the current cell stateC(t)(this calculation is shown in Fig.7.4a):

To compute h(t) in the equation y(t) = g o (w o ãh(t)), a third copy of the forget gate, denoted as fff(t), is essential This gate determines which components of the inputs to incorporate into h(t) and to what extent The formulation for fff(t) is given by fff(t) := σ(w fff (x(t) + h(t−1))).

Now, the only thing left for a completeoutput gate(whose result is actually not o(t)buth(t)), we need to multiply thefff(t)by the current cell state squashed between

The complete Long Short-Term Memory (LSTM) model incorporates a 'focus' mechanism represented by fff(t), which identifies the most crucial aspects of the cell state The functions off(t), ff(t), and fff(t) each play distinct roles in the model, aiming to facilitate memory retention, input storage, and concentration on specific cell state components While these mechanisms are designed with specific functions in mind, their effectiveness remains a hopeful expectation rather than a guaranteed outcome.

Long Short-Term Memory (LSTM) networks rely on specific sequences of calculations and information flow, which means that any interpretations derived from them are metaphorical The alignment of these mechanisms with those of the human brain is purely coincidental and highly unlikely, dependent on fortunate guesses rather than a direct correlation.

The LSTMs have been first proposed by Hochreiter and Schmidhuber in 1997

Recurrent neural networks (RNNs) have emerged as crucial deep learning architectures for various applications, including natural language processing and time series analysis For those looking to deepen their understanding of RNNs, we highly recommend the reference book [5], which serves as an excellent resource for anyone interested in specializing in these remarkable architectures.

Using a Recurrent Neural Network for Predicting

In this section, we present a practical example of a simple recurrent neural network designed for predicting the next words in a text sequence This task is versatile, enabling not only word predictions but also question answering, where the answer is the subsequent word Our example is adapted from a previous source, featuring detailed comments and explanations for clarity While the Python 3 code is functional, users must install the necessary dependencies to run it Understanding the concepts from the accompanying chapter is beneficial, but having the actual code on your computer will help grasp the finer details We begin by importing essential Python libraries, including Dense and Activation from keras.layers, SimpleRNN from keras.layers.recurrent, and Sequential from keras.models, along with numpy as np.

The next thing is to define hyperparameters: hidden_neurons = 50 my_optimizer ="sgd" batch_size = 60 error_function = "mean_squared_error" output_nonlinearity = "softmax" cycles = 5 epochs_per_cycle = 3 context = 3

You can obtain the code from the book's GitHub repository or by manually typing it into a single text file, which you can then rename with a py extension.

In this article, we explore key variables in our model configuration The variable `hidden_neurons` indicates the number of hidden units utilized, specifically Elman units, which corresponds to the number of feedback loops in the hidden layer The `optimizer` variable specifies the Keras optimizer, which in this case is set to stochastic gradient descent (referred to as "sgd" in Keras), and we encourage experimenting with various optimizers for optimal results Additionally, the `batch_size` variable determines the number of examples processed in each iteration of the stochastic gradient descent.

= "mean_squared_error"tells Keras to use the MSE we have been using before.

The softmax activation function, also known as "softmax" in Keras, introduces a new concept in output nonlinearity It is mathematically defined as ζ(z_j) := e^(z_j), which transforms the raw output values into a probability distribution.

The softmax function is essential for transforming a vector of arbitrary real values into a vector with values between 0 and 1 that sum to 1, making it ideal for the final layer of deep neural networks in multiclass classification tasks In binary classification scenarios with a two-component vector, the softmax function simplifies to the logistic function The subsequent part of the SRN code involves creating a function to read a text file and process its content into a list of words, which will be discussed further when the relevant parameters are activated in the code.

This part of the code opens a plain text filetesla.txt, which will be used for training and predicting This file should be encoded in utf-8 or theutf-8in the

7 There is a full list on https://keras.io/optimizers/.

In scenarios involving more than two classes, the approach differs from binary classification, where only one class is analyzed using a logistic function to generate a probability score for class A The probability score for class B is subsequently derived from the score of class A, illustrating the interconnected nature of probability assessments in multi-class classification systems.

When using a Recurrent Neural Network (RNN) for word prediction, it's crucial to ensure that the code aligns with the correct file encoding, as modern text editors differentiate between the actual file encoding and the display encoding This method is effective for files up to 70% of the computer's RAM; for instance, a 16GB machine can efficiently handle a 10GB plain text file To put this in perspective, the entire English Wikipedia, including metadata and page history, is approximately 14GB in plain text For larger datasets, a different strategy is required, such as dividing the file into manageable chunks or batches, but the specifics of this big data processing are beyond the scope of this discussion.

When Python reads a file, it processes it line by line, accumulating these lines into a list called `clean_text_chunks` These lines are then combined into a single string, `clean_text`, which is split into individual words and stored in a list named `text_as_list` The function `create_tesla_text_from_file(textfile="tesla.txt")` is designed to accept a filename as an argument, defaulting to "tesla.txt" if none is provided The final line, `text_as_list = create_tesla_text_from_file()`, calls the function with the default filename and stores the result in `text_as_list`, where each element corresponds to a word, including potential repetitions The subsequent code handles these repetitions by creating a set of distinct words, calculating the number of unique words, and establishing mappings between words and their indices with `word2index` and `index2word` dictionaries.

The process of analyzing text begins with counting the number of words, followed by creating a dictionary of unique words with their corresponding positions using the word2index function Conversely, the index2word function generates a dictionary that maps positions back to words The function `create_word_indices_for_text(text_as_list)` takes a list of words and generates input-output pairs, where input sequences consist of a defined context of words, and the label corresponds to the word immediately following that context This function ultimately returns the input and label word lists for further processing.

This function generates two lists from the original text: one containing input words and another with label words, both formatted as individual words For example, given the text "why would anyone ever eat anything besides breakfast food?", the function processes this to create the necessary lists for further analysis.

‘input’/‘label’ structure for predicting the next word, and we do this by decomposing this sentence into an array:

Why would anyone choose to eat anything besides breakfast food? Breakfast is often considered the most important meal of the day, providing essential nutrients and energy Many people enjoy a variety of breakfast options, making it a versatile and satisfying choice Choosing breakfast foods can enhance your morning routine and set a positive tone for the day ahead.

In our approach, we utilize three input words and designate the subsequent word as the label, subsequently shifting one word and repeating the process The number of input words is determined by the hyperparameter context, which can be adjusted as needed The function `create_word_indices_for_text(text_as_list)` processes the text in list format, generating both the input words and label word lists The code segment initializes `input_vectors` with zeros, shaped by the dimensions of the input words, context, and total number of words, while also creating `vectorized_labels` to store the corresponding labels.

This code generates 'blank' tensors filled with zeros In mathematics, 'matrix' and 'tensor' refer to distinct objects that perform specific operations, while in computer science, both are viewed as multidimensional arrays The key difference lies in the structural focus of computer science, which emphasizes the iteration along dimensions and the elements associated with those dimensions.

‘axis’) have the same shape The type of entries in the tensors will beint16, but you can change this as you wish.

Tensor dimensions are crucial in understanding how input vectors are structured The tensor known as `input_vectors` is essentially a third-order tensor, which can be interpreted as a 3D array or matrix This tensor is created by applying one-hot encoding to a set of three contextually relevant words, distinguishing it from a bag of words approach The one-hot encoding introduces a second dimension, accounting for both the context and the number of words The third dimension, represented in the code as `len(input_words)`, serves to group all input vectors similarly to how matrices were handled in previous discussions In contrast, `vectorized_labels` requires one less dimension since it only represents a single label word rather than multiple words To populate the initialized blank tensors with 1s in the correct locations, the subsequent code segment executes this task effectively.

7.6 Using a Recurrent Neural Network for Predicting Following Words 149 for i, input_w in enumerate(input_words): for j, w in enumerate(input_w): input_vectors[i, j, word2index[w]] = 1 vectorized_labels[i, word2index[label_word[i]]] = 1

Understanding how the code 'crawls' through tensors to place the 1s correctly can be challenging After addressing the complex aspects, we proceed to define a simple recurrent neural network using Keras functions The model is initialized as a Sequential model, incorporating a SimpleRNN layer with specified hidden neurons, input shape, and unrolling Additionally, a Dense layer is added, followed by an Activation layer to apply the output nonlinearity Finally, the model is compiled using a defined loss function and optimizer.

Learning Representations

In this chapter, we explore unsupervised deep learning, also referred to as representation learning or learning distributed representations We will address a gap from Chapter 3, where we examined Principal Component Analysis (PCA) as a method for learning distributed representations, framing the challenge as one of optimization.

To calculate the matrix Q in the equation Z = XQ, it is essential to first determine the covariance matrix of X The covariance matrix provides insights into the relationships between the entries of the original matrix Specifically, the covariance between two random variables, X and Y, is defined as COV(X, Y) = E((X - E(X))(Y - E(Y))), illustrating how these variables change together It's important to understand that, conceptually, all data can be viewed as random variables, enabling us to analyze their interdependencies.

The equation E(X) = M E AN(X) is applicable primarily when the distribution of X is uniform However, it can still provide practical insights even if this condition is not met, particularly in machine learning contexts where optimization processes allow for some flexibility in precision.

The attentive reader may observe that E(X) is a vector, whereas M E AN(X) represents a single value To reconcile this difference, we will utilize a technique known as broadcasting Broadcasting a value v into an n-dimensional vector involves replicating v across all components of the vector, effectively creating a uniform vector of the form broadcast(v, n) = (v, v, v, , v) n.

1 The expected value is actually the weighted sum, which can be calculated from a frequency table.

If 3 out of five students got the grade ‘5’, and the other two got a grade ‘3’, E (X) = 0.6 ã 5 + 0.4 ã 3. © Springer International Publishing AG, part of Springer Nature 2018

In this article, we define the covariance matrix of matrix X as (X), a notation that, while unconventional, helps prevent confusion with standard notations like Cor, which are used differently in this context To explore the covariance matrix more rigorously, we begin with a column vector.

X=(X 1 ,X 2 , ,X d ) ⊤ populated with random variables, the covariance matrix

X (which can also be denoted as i j ) can be defined as i j =C O V(X i ,X j )E((X i −E(X i ))(X j −E(X j ))), or if we write the wholed×d matrix:

The covariance matrix quantifies the 'self'-covariance of its elements, indicating how they vary together One essential property of the covariance matrix $X$ is its symmetry, as it reflects the covariance between the elements of $X$ with themselves.

Y is the same as the covariance ofY withX.(X)is also apositive-definite matrix, which means that the scalarv ⊤ X zis positive for every non-zero vectorv.

Eigenvectors of a square matrix A are unique vectors that maintain their direction, though their length may vary, when multiplied by A There are exactly d eigenvectors for a d×d matrix, but finding them can be challenging One effective method for determining eigenvectors is gradient descent, although numerous numerical libraries are available to simplify this process.

Eigenvectors of a matrix A retain their direction while only altering their length when multiplied by the matrix It is standard to normalize these eigenvectors, referred to as v_i, and the change in length is represented by the eigenvalue, typically denoted as λ_i This relationship leads to a key property of eigenvectors and eigenvalues, expressed as Av_i = λ_i v_i.

Once we have thevs andλs, we start by arranging the lambdas in descending order: λ 1 >λ 2 > >λ d

This also creates an arrangement in the corresponding eigenvectorsv 1 ,v 2 , …,v d

In this article, we establish a one-to-one correspondence between eigenvalues and eigenvectors, allowing us to align the order of eigenvectors with their respective eigenvalues We construct a d×d matrix by arranging the eigenvectors as columns, sorted according to the sequence of the corresponding eigenvalues This process effectively renames the entries to maintain consistency throughout the matrix.

8.1 Learning Representations 155 follow the usual matric entry naming conventions):

We now create a blank matrix of zeros (sized×d) and put the lambdas in descending order on the diagonal We call this matrix:

With this, we turn to the eigendecomposition of a matrix We need to have a symmetric matrix Aand then its eigendecomposition is:

For a symmetric matrix with linearly independent eigenvectors, the eigendecomposition allows us to derive essential equations applicable to any covariance matrix The key condition is the linear independence of all eigenvectors.

SinceV is orthonormal, 2 we also haveV ⊤ V =I Now we are ready to return to

Z =X Q Let us take a look at the transformed dataZ We can express the covariance of Zas the covariance ofXmultiplied byQ:

2 We omit the proof but it can be found in any linear algebra textbook, such as e.g [1].

We now have to choose a matrixQso that we get what we want (correlation zero and features ordered according to variance) We simply chose Q:=V Then we have:

Let us see what we have achieved All elements except the diagonal elements of

In principal component analysis (PCA), when the covariance matrix Z is zero, the remaining correlations are solely along the diagonal, indicating that each variable's covariance with itself is represented as its variance This results in a matrix organized by descending variance, where the relationship is expressed as COV(Xi, Xi) = λi The principles of PCA apply not only to 2D matrices but also extend to tensors, providing a comprehensive understanding of the analysis For more detailed information on PCA, refer to [2].

We can create different representations of data where features have zero covariance and are sorted by variance, resulting in a distributed representation that eliminates traditional columns like 'height' in favor of synthetic ones The key is to determine the constraints we want the final data to adhere to; if we prefer to leave these constraints unspecified and instead provide examples, we must adopt a more general approach This leads us to autoencoders, which demonstrate remarkable versatility across various tasks.

Different Autoencoder Architectures

An autoencoder is a type of three-layer feed-forward neural network designed for unsupervised learning, where the output is identical to the input This structure requires the output layer to have the same number of neurons as the input layer, leading to the concept of the 'plain vanilla autoencoder.' However, a challenge arises if the hidden layer contains as many or more neurons than the input and output layers, as this can result in the autoencoder learning the identity function To avoid this, simple autoencoders must have fewer neurons in the hidden layer than in the input and output layers The hidden layer's outputs provide a distributed representation akin to PCA, which can enhance the performance of subsequent models like logistic regression Alternatively, sparse autoencoders can be implemented by limiting the hidden layer neurons to at most double the input layer's size while applying a significant dropout rate, ensuring fewer active hidden neurons during training.

Different autoencoder architectures produce varied representations of input data Simple autoencoders create compact distributed representations, enhancing the ability of neural networks to process information and improve accuracy In contrast, sparse autoencoders generate larger hidden layer vectors that capture redundancies, resulting in a more diluted representation that simplifies processing These architectures utilize a sparsity rate to set activations below a specific threshold to zero, facilitating a more efficient representation of the input.

Denoising autoencoders enhance the learning process by introducing noise into the input data, which involves adding random values to a subset, such as 10% of the input The objective remains to reconstruct the original input without the noise By incorporating explicit regularization, we create contractive autoencoders, which further refine the model While numerous other complex autoencoder variations exist, they are beyond the scope of this discussion; interested readers are encouraged to explore additional resources for more information.

Autoencoders play a crucial role in preprocessing data for feed-forward neural networks The essential output utilized for this purpose comes from the hidden layer of the autoencoder, which performs the majority of the data transformation tasks.

A latent variable is an underlying variable that correlates with one or more observable variables For instance, in Chapter 3, we informally discussed Principal Component Analysis (PCA), using 'height' and 'weight' as examples of latent variables When we hypothesize or create a latent variable, we assume it has a probability distribution that defines it The philosophical debate centers around whether we discover or define these variables; however, our goal is to ensure that our defined latent variables closely align with those found in nature A distributed representation serves as a probability distribution for this purpose.

Autoencoders, including plain vanilla, simple, sparse, denoising, and contractive types, aim to learn objective latent variables by measuring the similarity between probability distributions The learning process concludes when these distributions closely align, typically assessed using the Kullback-Leibler divergence.

The Kullback-Leibler divergence, denoted as D_KL, measures the difference between two probability distributions P and Q, and is not symmetric, meaning that D_KL(P,Q) differs from D_KL(Q,P) For further details, readers can refer to source [3] Autoencoders, first proposed by Dana H Ballard in 1987 and independently considered by Yann LeCun, represent a foundational concept in machine learning A comprehensive overview of various autoencoder types and their functionalities, including stacked denoising autoencoders, is provided in source [6], which will be explored in the following section.

Stacking Autoencoders

Autoencoders can be compared to LEGO bricks, as they can be stacked together to form stacked autoencoders The key to their effectiveness lies not in the output layer, but in the activations of the middle layer, which serve as inputs for a traditional neural network To effectively stack autoencoders, it's essential to combine their middle layers rather than simply placing one autoencoder after another For example, consider two simple autoencoders with a configuration of (13, 4, 13).

Fig 8.2 Stacking a (4, 3, 4) and a (4, 2, 4) autoencoder resulting in a (4, 3, 2, 3, 4) stacked autoencoder

To process the same data, it is essential to maintain consistent input and output sizes, while allowing variations in the middle layer or autoencoder architecture A simple autoencoder can be structured as a stacked configuration of 13, 7, 4, 7, and 13 layers, creating a natural bottleneck that enhances its functionality The primary outcome of this stacked autoencoder is the distributed representation formed by the middle layer We will implement denoising autoencoders based on the methodology outlined in [6], with modifications to the code provided at https://blog.keras.io/building-autoencoders-in-keras.html The initial segment of the code includes necessary import statements: from keras.layers import Input, Dense; from keras.models import Model; from keras.datasets import mnist; import numpy as np.

(x_train, _), (x_test, _) = mnist.load_data()

The final line of code imports the MNIST dataset from Keras, which provides a convenient built-in function to load the data into Numpy arrays This function returns two pairs: one for training samples and labels (60,000 rows) and another for testing samples and labels (10,000 rows) Since labels are not needed, we store them in an anonymous variable to satisfy the function's requirement for two return values The subsequent code preprocesses the MNIST data by converting the training and testing arrays to float32 format and normalizing the pixel values by dividing by 255.0, with a specified noise rate of 0.05.

This code snippet normalizes original pixel values from a range of 0 to 255 into a scale of 0 to 1, utilizing the Numpy float32 data type for precision It incorporates a noise rate parameter, essential for the upcoming processes The training and test datasets, `x_train` and `x_test`, are augmented with Gaussian noise, generated using a normal distribution with a mean of 0.0 and a standard deviation of 1.0 Finally, the noisy data is constrained within the range of 0.0 to 1.0 using the `np.clip` function to ensure valid pixel values.

This part of the code introduces the noise into a copy of the data Note that the np.random.normal(loc=0.0, scale=1.0, size=x_train.shape)

NumPy is a powerful Python library designed for efficient array manipulation and fast numerical computations In this process, we generate a new array with the same size as the x_train array, filled with Gaussian random variables where the mean (loc) is set to 0.0 and the standard deviation (scale) is set to 1.0 This array is then scaled by the noise rate and added to the original data To ensure the data remains within the range of 0 to 1 after this addition, we implement two additional steps Subsequently, we reshape our arrays from dimensions (60000, 28, 28) and (10000, 28, 28) into (60000, 784) and (10000, 784) respectively, as previously discussed in relation to the MNIST dataset The reshaping is executed with the following code: x_train = x_train.reshape((len(x_train), np.prod(x_train.shape[1:]))) and similarly for x_test, x_train_noisy, and x_test_noisy, ensuring that the number of features in x_train_noisy matches that of x_test_noisy.

The initial four rows of the code reshape the arrays, while the final row checks if the sizes of the noisy training and test vectors match, which is essential for the autoencoder's functionality Any size mismatch will cause the program to crash, but this intentional failure aids in debugging by pinpointing the error location With the preprocessing complete, we proceed to construct the autoencoder, starting with the input layer followed by three encoding layers with varying neuron counts and activation functions, and concluding with three decoding layers that mirror the encoding structure to reconstruct the original input.

This article presents a unique approach to model building by manually connecting layers of varying sizes (128, 64, 32, 64, 128) and experimenting with different activation functions It's crucial to note that both the input and output sizes match the shape of x_train_noisy After defining the layers, the model is constructed using the autoencoder framework, which can be customized with various optimizers and error functions The autoencoder is compiled with the 'sgd' optimizer and 'mean_squared_error' loss, and trained on the dataset for five epochs with a specified batch size.

To optimize the performance of your autoencoder, consider increasing the number of epochs once the initial code is functioning correctly In the final evaluation stage, we assess the model's performance, make predictions, and extract the weights from the deepest middle layer When printing the weight matrices, the first matrix where dimensions begin to increase (in this case, from (32, 64)) corresponds to the stacked autoencoder Use the `evaluate` method to gauge the model's accuracy on the noisy test data, and display the results The shapes of all weight matrices can be retrieved using `get_weights()`, allowing you to identify the weight matrix from the deeply encoded MNIST data Finally, save all weights to a file named "all_AE_weights.h5" for future use.

The weight matrix, stored in the variable deeply_encoded_MNIST_weight_matrix, contains the trained weights from the central layer of the stacked autoencoder and is intended for use in a fully connected neural network alongside the corresponding labels This matrix serves as a distributed representation of the original dataset, with a backup of all weights saved in an H5 file for future reference Additionally, a variable named results has been introduced for making predictions with the autoencoder, primarily to evaluate its quality rather than for practical predictions.

Recreating the Cat Paper

In this section, we recreate the idea presented in the famous ‘cat paper’, with the official titleBuilding High-level Features Using Large Scale Unsupervised Learning

This article discusses a groundbreaking neural network capable of recognizing cats by analyzing frames from 10 million YouTube videos Instead of using labeled data, the authors employed an unlabelled dataset and tested the network against images from ImageNet, allowing it to learn through reconstruction as an autoencoder The network's activations formed distinct patterns when identifying cats, leading to implicit label generation Additionally, the authors demonstrated the network's ability to create a representation of a cat's face by optimizing the output of the most effective 'cat finder' neuron By combining the top images recognized by this neuron, they produced a unique drawing of a cat face, showcasing the network's capability to recognize and generate new images based on its learned understanding.

The architecture utilized in our project was substantial, featuring 16,000 computer cores and a training duration of three days The autoencoder comprised over 1 billion trainable parameters, which remains a small fraction compared to the synapses in the human visual cortex Training utilized input images formatted as 200 by 200 by 3 tensors, while testing involved 32 by 32 by 3 tensors A receptive field of 18 by 18 was employed, akin to convolutional networks, with unique weights assigned to each 'tile' of the field We implemented 8 feature maps, followed by an L2 pooling layer, which processes regions by squaring the inputs, summing them, and taking the square root for output, in contrast to traditional max-pooling.

The overall autoencoder consists of three identical parts, each designed to process input through a receptive field without shared weights, followed by L2 pooling and local contrast normalization After completing the first part, the subsequent two sections mirror this architecture The entire network is trained using asynchronous Stochastic Gradient Descent (SGD), where multiple SGDs operate simultaneously across different sections, accessing a central weights repository At the start of each training phase, every SGD requests weight updates from the repository, performs optimization, and returns the results.

In the process of recreating the Cat Paper, the results are sent back to the repository, allowing other instances utilizing asynchronous Stochastic Gradient Descent (SGD) to access them The minibatch size employed for this procedure was 100 For further details, readers are encouraged to consult the original paper.

1 S Axler, Linear Algebra Done Right (Springer, New York, 2015)

2 R Vidal, Y Ma, S Sastry, Generalized Principal Component Analysis (Springer, London, 2016)

3 I Goodfellow, Y Bengio, A Courville, Deep Learning (MIT Press, Cambridge, 2016)

4 D.H Ballard, Modular learning in neural networks, in AAAI-87 Proceedings (AAAI, 1987), pp. 279–284

5 Y LeCun, Modeles connexionnistes de l’apprentissage (Connectionist Learning Models) (Uni- versité P et M Curie (Paris 6), 1987)

6 P Vincent, H Larochelle, I Lajoie, Y Bengio, P.-A Manzagol, Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion J Mach Learn. Res 11 , 3371–3408 (2010)

7 Q.V Le, M.A Ranzato, R Monga, M Devin, K Chen, G.S Corrado, J Dean, A.Y Ng, Building high-level features using large scale unsupervised learning, in Proceedings of the 29th Interna- tional Conference on Machine Learning ICML (2012)

Word Embeddings and Word Analogies

Neural language models utilize distributed representations to convert words and sentences into numerical vectors, known as word embeddings These learned representations enable the transformation of words into numerical formats, making neural language models an effective method for generating word embeddings.

Word embedding refers to a numerical representation of words, exemplified by the phrase "Nowhere fast" represented as (1,0,0,5.678,−1.6,1) This chapter highlights the Word2vec model, one of the most renowned neural language models, which utilizes a straightforward neural network to learn vectors that effectively represent words.

The predict-next setting for recurrent neural networks allows for the calculation of word distances, ensuring that similar words are located close together Traditionally, the Hamming distance is used to measure the difference between two strings of the same length, defined as the number of differing characters For example, the Hamming distance between the words 'topos' and another word can be calculated by comparing their characters.

The term ‘topoi’ has a distance of 1 from ‘friends’, while ‘fellows’ and ‘friends’ are separated by a distance of 5 Similarly, ‘friends’ and ‘0r$8MMs’ also have a distance of 5 This distance can be normalized to a percentage by dividing it by the length of the words involved Although this method offers a useful insight into language processing, its application is somewhat limited.

The Hamming distance is a fundamental method among various string similarity measures known as string edit distance metrics More advanced techniques, like Levenshtein distance and Jaro-Winkler distance, can evaluate strings of different lengths and apply varying penalties for errors such as insertion, deletion, or modification However, these measures focus solely on the form of the words, making them ineffective for comparing semantically similar terms like 'professor' and 'teacher.' This highlights the need to embed words to better capture their meaning.

166 9 Neural Language Models vector in a way which will convey information about the meaning of the word (i.e. its use in our language).

To effectively represent words as vectors, it's essential to establish a distance measure between them This leads us to the concept of cosine similarity, which quantifies the similarity between two vectors For a comprehensive understanding of cosine similarity, refer to the detailed overview provided in [5] The cosine similarity between two-dimensional vectors $ v $ and $ u $ can be calculated to determine their degree of similarity.

Wherevi andui are components ofvandu, wherev andu represent the norms of the respective vectors Cosine similarity measures the correlation between these vectors, ranging from 1 (indicating equality) to -1 (indicating opposition), with 0 signifying no correlation In the context of bag-of-words models, one-hot encoding, or similar word embeddings, cosine similarity ranges from 0 to 1, as these vector representations do not include negative components In this scenario, a similarity score of 0 indicates an 'opposite' relationship.

In this section, we will explore the Word2vec neural language model, focusing on its input requirements, output capabilities, tunable parameters, and its integration within a larger system.

CBOW and Word2vec

The Word2vec model can be built with two different architectures, the skip-gram and the Word2vec Both of these are actually shallow neural networks with a twist.

To illustrate the difference, consider the sentence, "Who are you, that you do not know your history?" We begin by removing uppercase letters and punctuation Both architectures rely on the context of words, which includes the surrounding words as well as the word itself It is essential to determine the size of the context beforehand; for simplicity, we will use a context size of one This means each word's context will consist of one word before and one word after it We can then break down the sentence into word and context pairs.

Word2vec consists of two learned models: the skip-gram and the Continuous Bag of Words (CBOW) The skip-gram model predicts a target word based on its context, meaning it can infer words like 'who' when given 'are' or 'not' and 'your' when given 'know' In contrast, the CBOW model predicts the main word using surrounding context words, where it takes two context words (c1 and c2) to predict the middle word (m).

1 If the context were 2, it would take 4 words, two before the main word and two after.

The creation of word embeddings resembles the structure of autoencoders, utilizing a shallow feedforward network This network's input layer accepts word index vectors, necessitating a number of input neurons equivalent to the unique words in the vocabulary The hidden layer's neuron count, known as embedding size, typically ranges from 100 to higher values.

The vocabulary size in modest datasets is often limited to around 1000 words, with the number of output neurons matching the input neurons The connections from input to hidden layers are linear, lacking activation functions, while the hidden to output layers utilize softmax activations The model's deliverables are represented by the weights of the input to hidden connections, which form a matrix containing individual word vectors To extract the appropriate word vector, one can multiply this matrix by the word index vector for the desired word, with these weights being trained through backpropagation It's important to clarify that the concept behind Word2vec, which posits that a word's meaning is shaped by its context in language, has often been misattributed to Harris's 1954 paper In reality, this idea was first introduced in Wittgenstein’s 1953 work, "Philosophical Investigations," highlighting the significant influence of ordinary language philosophy on natural language processing.

Fig 9.1 CBOW Word2vec architecture

Word2vec in Code

In this article, we present a CBOW Word2Vec implementation, which will be detailed across two connected sections of code in a single Python file We begin by importing the necessary libraries and defining hyperparameters, including the embedding size set to 300 and a context window of 2 The text is represented as a list containing the words: "who," "are," "you," "that," "you," "do," "not," "know," "your," and "history."

The `text_as_list` can store any text, allowing you to input your content or utilize code from a recurrent neural network to convert a text file into a list of words The embedding size corresponds to the hidden layer's size, determining the dimensions of the word vectors The context refers to the number of words surrounding a target word, with a context of 2 indicating the use of two words before and after the target The subsequent code block, similar to that used in recurrent neural networks, defines distinct words, calculates the total number of unique words, and creates mappings from words to indices and vice versa.

This code generates two types of dictionaries: one mapping words to their indices and another mapping indices back to words It also includes a function that creates two lists: one for main words and another for context words associated with each main word The function processes the input text to extract context words based on their position relative to the main word, ensuring that it handles edge cases where the main word is at the beginning or end of the text Finally, it initializes input vectors and vectorized labels, populating them based on the context words and their corresponding main words, effectively preparing the data for further analysis or modeling.

This block of code defines a function that processes a list of words, returning two outputs: a copy of the original list and a list of lists containing contextual words for each corresponding word The function is applied to the variable text_as_list, followed by the creation of two zero-initialized matrices to represent word vectors for the outputs The code then updates these matrices to reflect the relationships between the target word and its context Finally, the Keras model is initialized and trained to analyze these word vectors.

170 9 Neural Language Models word2vec = Sequential() word2vec.add(Dense(embedding_size, input_shape=(number_of_words,), activation=

The Word2Vec model is built using a linear layer with a softmax activation function, and it is compiled with mean squared error as the loss function and stochastic gradient descent as the optimizer The model is trained on input vectors and vectorized labels over a specified number of epochs and batch size After training, the model's performance is evaluated, and the accuracy is printed as a percentage.

The model follows closely the architecture we presented in the last section.

The model is trained for 1500 epochs, allowing for experimentation with different settings To create a skip-gram model, simply interchange the matrices used in the training process, specifically in the line where word2vec.fit is applied to input_vectors and vectorized_labels.

To implement a skip-gram model using Word2Vec, modify the fitting function to `word2vec.fit(vectorized_labels, input_vectors, epochs=00, batch_size, verbose=1)` After training, extract the weights by using `word2vec.save_weights("all_weights.h5")` and retrieve the embedding weight matrix with `embedding_weight_matrix = word2vec.get_weights()[0]`.

The initial line of code generates word vectors in a number_of_words×embedding_size array, allowing us to select the corresponding vector for each word It also saves all network weights to an H5 file, which is essential for various applications of word2vec We can either train weights from scratch, as demonstrated in our code, or fine-tune pre-existing word embeddings, such as those derived from Wikipedia, by loading saved weights into a model and training it on more specific texts, like legal documents Additionally, word vectors can replace one-hot encoded words or Bag of Words representations, enabling their use in other neural networks for tasks such as sentiment prediction.

The H5 file includes all the network weights, but we only need the weight matrix from the first layer This specific matrix is retrieved by the last line of code and is referred to as embedding_weight_matrix We will utilize embedding_weight_matrix in the subsequent code section, which should be included in the same file as the current code.

To save and load weights from an H5 file, we can create a new network with the same configuration, fine-tune the weights, and extract the weight matrix using the same code as before.

9.4 Walking Through the Word-Space: An Idea That Has Eluded Symbolic AI

Word vectors represent a fascinating type of word embeddings that extend beyond traditional understanding Historically, reasoning has been perceived as a symbolic concept linking various relationships of objects, with symbols considered logically primitive and devoid of intrinsic meaning This perspective has dominated the logical approach to artificial intelligence (GOFAI) for decades, equating rationality with intelligence and suggesting that higher cognitive abilities embody intelligence However, Hans Moravec's findings revealed that tasks like chess playing and theorem proving are actually easier than recognizing cats in unlabelled images, prompting the AI community to reevaluate the established notions of intelligence and explore the significance of lower faculty reasoning.

Low faculty reasoning refers to the ability to evaluate statements based on their degree of accuracy For example, while both "a tomato is a vegetable" and "a tomato is a suspension bridge" are technically false, most individuals recognize a level of nuance in their wrongness Thus, saying "a tomato is a vegetable" is considered less incorrect than claiming "a tomato is a suspension bridge." This understanding highlights the importance of context and the spectrum of truth in reasoning.

In linguistic classification, statements like "a tomato is a suspension bridge" illustrate how language operates through social conventions rather than natural phenomena These examples focus on classes defined by shared properties rather than the objects themselves, emphasizing the significance of descriptions in language use The singular terms used in these sentences highlight the relationship between concepts, while the symbolic phrase "is a" serves merely as a connector, lacking inherent meaning.

An agent confined to a room with only foreign language books could demonstrate intelligence by identifying patterns, such as distinguishing words that represent places and people For instance, if she recognizes the similarity between the sentences ‘Luca frequenta la scuola elementare Pedagna’ and ‘Marco frequenta la scuola elementare Zolino’, it indicates her cognitive ability She might deduce that ‘Luca’ relates to ‘Pedagna’ in the same way ‘Marco’ relates to ‘Zolino’ Furthermore, upon encountering the new sentence ‘Luca vive in Pedagna’, she could logically infer ‘Marco vive in Zolino’, showcasing her reasoning skills in understanding semantically similar terms.

Using Word2Vec, we can identify similarities among terms in our datasets and reason about them effectively To demonstrate this, we will continue with the code from the previous section in the same Python file By utilizing the embedding weight matrix, we can explore a unique method for measuring word similarities through word vector clusterings, enabling us to calculate and reason with words based on their vector representations.

Energy-Based Models

Energy-based models, a unique category of neural networks, include the Hopfield Network, which originated in the 1980s Despite their simplicity, Hopfield networks differ significantly from previous models Composed of interconnected neurons, each neuron is associated with weights (w_ij) linking them, as well as a threshold (b_i) Neurons can hold values of either -1 or 1, representing white and black in image processing, respectively, without any shades of grey The inputs to these neurons are denoted as x_i, exemplified by a basic Hopfield network as illustrated in Fig 10.1a.

Once a network is assembled, the training can start The weights are updated by the following rule, wherendenotes an individual training sample: w i j N n=1 x i (n) x (n) j (10.1)

Then we compute activations for each neuron: y i j w i j x j (10.2)

In updating weights, there are two approaches: synchronous updates, where all weights are adjusted simultaneously, and asynchronous updates, which is the standard method involving one weight at a time Hopfield networks are characterized by the absence of recurrent connections, meaning that the weight of a neuron to itself is zero, and all connections are symmetric, indicating that the weight from neuron i to j is equal to the weight from j to i An example of a simple Hopfield Network processing pixel images can be illustrated with vectors a = (-1, 1, -1) and b = (1, 1, -1).

Fig 10.1 Hopfield networks c=(−1,−1,1) Using the equation above, we calculate the weight updates with the update equation: w11=w22=w33=0 w12=a1a2+b1b2+c1c2= −1ã1+1ã1+(−1)ã(−1)=1 w 13 = −1 w 23 = −3

Hopfield networks utilize a global measure of success known as energy, akin to the error function in traditional neural networks This energy value represents the overall performance of the network at each stage of training, providing a single metric that reflects the network's state.

As learning advances, the energy (E N E) in Hopfield networks either stabilizes or decreases, leading to the discovery of local minima, which represent memories of training samples For logical functions like conjunction and disjunction, Hopfield networks require three neurons, while four neurons are necessary for the XOR function.

The next model we briefly present are Boltzmann machines first presented in

Boltzmann machines, introduced in 1985, share similarities with Hopfield networks but include both input and hidden neurons interconnected with non-recurrent, symmetrical weights A typical Boltzmann machine is illustrated in Fig 10.2a, where hidden units are initialized randomly to create a hidden representation that mimics the inputs This process generates two probability distributions that can be evaluated using the Kullback-Leibler divergence (KL) The primary objective is to compute the gradient ∂KL/∂w and utilize backpropagation to optimize the weights.

Fig 10.2 Boltzmann machines and restricted Boltzmann machines

Restricted Boltzmann Machines (RBMs) are a subclass of Boltzmann machines characterized by the absence of connections between neurons within the same layer, allowing for a modified backpropagation method akin to that used in feed-forward networks Comprising two layers—a visible layer for inputs and outputs and a hidden layer—RBMs calculate outputs during the forward pass using the equation y=σ(x ⊤ w+b [h]) They differ from autoencoders by incorporating a reconstruction phase, where outputs are fed back to the hidden layer and then returned to the visible layer, with the reconstruction error measured by Kullback-Leibler divergence for backpropagation RBMs are sensitive to reconstruction errors, indicating effective learning Deep Belief Networks (DBNs), which are stacks of RBMs, can be trained as generative models or classifiers using backpropagation or contrastive divergence, the latter being an efficient algorithm for approximating log-likelihood gradients For further reading on contrastive divergence and cognitive aspects of energy-based models, refer to the suggested literature.

Memory-Based Models

Neural Turing Machines (NTMs) represent an innovative memory-based model first introduced in [9] Similar to traditional Turing machines, which utilize a read-write head and a tape for memory, NTMs aim to execute algorithms by processing inputs to produce outputs However, the key distinction lies in their design: all components of NTMs are trainable, enabling them to perform soft computations and adaptively learn to enhance their performance.

The neural Turing machine functions like an LSTM by processing input sequences to produce output sequences To obtain a single result, we simply extract the final component while disregarding the rest This architecture is an extension of LSTM, much like how LSTMs enhance simple recurrent networks.

A neural Turing machine consists of several key components, primarily the controller and memory The controller, which functions as an LSTM, processes inputs at each time step, denoted as $ t $, utilizing both the current raw inputs $ x_t $ and the results from the previous step $ r_t $ Additionally, the machine incorporates a memory component, represented as a tensor $ M_t $, typically structured as a matrix While memory does not directly serve as an input to the controller, it plays a crucial role in the overall operation of the neural Turing machine by providing the input $ M_{t-1} $ at each step.

The complete structure of a neural Turing machine, as depicted in Fig 10.3, emphasizes the use of tensors for representation and gradient descent for training In this model, traditional Turing-machine concepts are fuzzified, allowing for simultaneous access to multiple memory locations rather than isolating a single one Furthermore, the degree of memory access is also adjustable, enabling dynamic changes in the amount of memory utilized.

The neural Turing machine features an LSTM controller that integrates outputs from previous steps with new input vectors and a memory matrix to generate outputs, all of which are trainable To understand the functionality of its components, we focus on three key vectors produced by the controller: the add vector $ a_t $, the erase vector $ e_t $, and the weighting vector $ w_t $ While these vectors share similarities, each serves a distinct purpose in the machine's operation.

We will be coming back to them later to explain how they are produced.

Let us see how the memory works The memory is represented by a matrix (or possibly higher order tensor)M t Each row in this matrix is called amemory location.

If there arenrows in the memory, the controller produces a weighting vector of sizen

1 For a fully detailed view, see the blog entry of one of the creators of the NTM, https://medium. com/aidangomez/the-neural-turing-machine-79f6e806\penalty-

The components range from 0 to 1, indicating the extent to which each location is considered, allowing for either precise or fuzzy access to these locations This trainable vector is typically not crisp The reading operation is defined as the Hadamard product, which involves pointwise multiplication of the mbyn matrix Mt and B Here, B is derived by transposing the m-dimensional row vector wt and broadcasting its values to align with the dimensions of Mt.

The neural Turing machine (NTM) continuously writes data, but it may often replicate existing values, leading to the misconception that the content is unchanged It's crucial to understand that the NTM does not decide whether to overwrite data; instead, it always writes, occasionally producing the same value as before This behavior can create confusion regarding its decision-making process.

The write operation consists of two main components: the erase and add components The erase operation resets a memory location to zero only when both the weighting vector $ w_t $ and the erase vector $ e_t $ are equal to 1, represented mathematically as $ \hat{M}_t = M_{t-1} \circ (I - w_t \circ e_t) $ The add operation then uses $ \hat{M}_t $ to update the memory, following the equation $ M_t = \hat{M}_t + w_t \circ a_t $ Both operations function similarly, relying on trainable components without intrinsic differences Addressing plays a crucial role in linking these two operations, detailing how the weighting vectors $ w_t $ are generated through a complex procedure Notably, neural Turing machines utilize both location-based and content-based addressing methods.

Memory networks (MemNN), introduced as a simpler yet powerful memory-based model, enhance long-term dependency memory beyond traditional LSTM Comprising several components, all except the memory itself are neural networks, which aligns memory networks more closely with connectionism principles than neural Turing machines, while maintaining their robust capabilities.

• Memory (M): An array of vectors

• Input feature map (I): converts the input into a distributed representation

• Updater (G): decides how to update the memory given the distributed representation passed in byI

• Output feature map (O): receives the input distributed representation and finds supportive vectors from memory, and produces an output vector

• Responder (R): Additionally formats the output vectors given byO

The connections among the components are depicted in Fig 10.4, where all elements, except for memory, are represented by neural networks, making them trainable In a simplified model, I corresponds to word2vec, G stores representations in available memory slots, and R modifies outputs by replacing indexes with words and adding filler words O performs the critical task of locating multiple supporting memories through a process known as a hop, and subsequently bundles these memories with the information provided by I.

Bundling involves straightforward matrix multiplication of input and memory, enhanced by additional learned weights In connectionist models, this process relies on basic operations like addition and multiplication, with the learned weights playing a crucial role in driving performance A comprehensive, trainable complex memory network is introduced in [11].

Neural Turing machines and memory networks share a common challenge of relying on segmented vector-based memory Exploring the development of a memory-based model with continuous memory, potentially utilizing float-encoded vectors, could be intriguing However, it is important to note that even basic memory networks possess significantly more trainable parameters than LSTMs, resulting in longer training times A key challenge in memory models is parameter reuse across different components, which can enhance learning efficiency Additionally, memory addressing in memory networks is solely content-based.

2 By default, memory networks make one hop, but it has been shown that multiple hops are beneficial,especially in natural language processing.

The Kernel of General Connectionist Intelligence

10.3 The Kernel of General Connectionist Intelligence: The bAbI Dataset

Neural networks have evolved into a prominent subfield of artificial intelligence (AI), with deep learning emerging as a leading approach This raises the question of how to effectively evaluate neural networks as AI systems, bringing the classic Turing test back into consideration Thankfully, researchers have access to the bAbI dataset, which consists of various toy tasks designed for this purpose.

The bAbI dataset serves as a crucial benchmark for evaluating general AI capabilities, as any agent aspiring to be classified as general AI must successfully complete all tasks within it This dataset is significant for testing the effectiveness of purely connectionistic approaches in the pursuit of general artificial intelligence.

The dataset contains tasks expressed in natural language, categorized into twenty distinct types The first category focuses on single supporting facts, exemplified by statements like, "Mary went to the bathroom John moved to the hallway Mary traveled to the office Where is Mary?" Subsequent tasks introduce additional supporting facts, detailing more actions by the same individual Another task emphasizes understanding and resolving relationships, such as "The kitchen is north of the bathroom What is north of the bathroom?" A more intricate challenge is Task 19 (Path finding), which requires navigating from one location to another, as in "The kitchen is north of the bathroom How to get from the kitchen to the bathroom?" This task involves generating multi-step directions, adding a layer of complexity compared to simpler relationship resolution tasks.

The article discusses various natural language processing tasks, starting with binary answer questions and a unique counting task where a single agent picks up and drops items, requiring the network to determine the total number of items held Subsequent tasks focus on negation, conjunction, and three-valued responses ('yes', 'no', 'maybe'), followed by coreference resolution Additional tasks involve time reasoning, positional reasoning, and size reasoning, similar to Winograd sentences The article concludes with tasks related to basic syllogistic deduction, induction, and resolving the agent's motivation.

The dataset authors evaluated various methods, but the results from unmodified memory networks are particularly noteworthy, showcasing the potential of a purely connectionist approach We present the accuracy figures for these plain memory networks and recommend consulting the original paper for additional findings.

Winograd sentences are a specific type of sentence designed for testing a computer's ability to resolve pronoun coreferences Proposed as an alternative to the Turing test, they address the Turing test's significant flaws, such as encouraging deceptive behavior and the difficulty in quantifying results on a large scale An example of a Winograd sentence is, "I tried to put the book in the drawer but it was too [big/small]." These sentences are named after Terry Winograd, who first introduced them in the 1970s.

Recent findings highlight the effectiveness of memory networks in addressing coreference resolution and performing pure deduction However, challenges arise in inference-heavy tasks, particularly in path finding and size reasoning, where deduction is essential for obtaining results Notably, while memory networks excel in form-based reasoning, they lack a dedicated reasoning component Interestingly, a modified memory network achieved 100% accuracy in induction but dropped to 73% in deduction, emphasizing the critical need to enhance neural networks' reasoning capabilities to surpass current benchmarks set by memory networks.

1 J.J Hopfield, Neural networks and physical systems with emergent collective computational abilities Proc Nat Acad Sci U.S.A 79 (8), 2554–2558 (1982)

2 D.H Ackley, G.E Hinton, T Sejnowski, A learning algorithm for boltzmann machines Cogn. Sci 9(1), 147–169 (1985)

P Smolensky explores the foundations of harmony theory within the context of information processing in dynamical systems This work is featured in "Parallel Distributed Processing: Explorations in the Microstructure of Cognition," edited by D.E Rumelhart, J.L McClelland, and the PDP Research Group, published by MIT Press in Cambridge.

4 G.E Hinton, S Osindero, Y.-W Teh, A fast learning algorithm for deep belief nets Neural Comput 18(7), 1527–1554 (2006)

5 Y Bengio, P Lamblin, D Popovici, H Larochelle, Greedy layer-wise training of deep networks, in Proceedings of the 19th International Conference on Neural Information Processing Systems (MIT Press, Cambridge, 2006), pp 153–160

6 Y Bengio, Learning deep architectures for AI Found Trends Mach Learn 2(1), 1–127 (2009)

7 I Goodfellow, Y Bengio, A Courville, Deep Learning (MIT Press, Cambridge, 2016)

8 W Bechtel, A Abrahamsen, Connectionism and the Mind: Parallel Processing, Dynamics and Evolution in Networks (Blackwell, Oxford, 2002)

9 A Graves, G Wayne, I Danihelka, Neural turing machines (2014), arXiv:1410.5401

10 J Weston, S Chopra, A Bordes, Memory networks, in ICLR (2015), arXiv:1410.3916

11 S Sukhbaatar, A Szlam, J Weston, End-to-end memory networks (2015), arXiv:1503.08895

12 J Weston, A Bordes, S Chopra, A.M Rush, B van Merriởnboer, A Joulin, T Mikolov, Towards ai-complete question answering: A set of prerequisite toy tasks, in ICLR (2016), arXiv:1502.05698

13 T Winograd, Understanding Natural Language (Academic Press, New York, 1972)

An Incomplete Overview of Open Research Questions

We conclude this book with a list of open research questions A similar list, from which we have borrowed some of the problems we present here, can be found in [1].

We were hoping to compile a diverse list to show how rich and diverse research in deep learning can be The problems we find most intriguing are:

1 Can we find something else than gradient descent as a basis for backpropagation? Can we find something as an alternative to backpropagation as a whole for weight updates?

2 Can we find new and better activation functions?

Can reasoning be learned, and if so, how? If not, how can we approximate symbolic processes within connectionist architectures? To enhance artificial neural networks, we must explore the integration of planning, spatial reasoning, and knowledge Symbolic computation can be represented through numerical expressions that can be optimized For instance, the logical expression A → B can be represented numerically as B A ã A = B Given the ease of finding numerical representations for logical connectives, the question arises: can a neural network autonomously discover and implement these representations?

Deep learning methods, characterized by multiple layers of nonlinear operations, can be seen as analogous to the concept of reusing various subformulas within symbolic systems This raises the question of whether this analogy can be formally defined and explored further.

Convolutional networks are easier to train compared to other neural networks with the same number of parameters, primarily due to their unique architecture and the way they handle spatial hierarchies in data This efficiency in training allows for better performance and quicker convergence, making convolutional networks a preferred choice in various applications.

Developing an effective strategy for self-taught learning involves leveraging unlabelled samples, and even actively seeking training data through an autonomous agent This approach can enhance the learning process by utilizing available resources more efficiently.

The approximation of gradients in neural networks offers reasonable accuracy; however, it remains less computationally efficient compared to symbolic derivation Humans often find it simpler to estimate a value close to a target, such as a minimum, rather than calculating the precise number This raises the question: can we develop improved algorithms for calculating approximate gradients?

To prepare an agent for an unknown future task, it is essential to develop a strategy that enables the agent to anticipate the task and begin learning immediately This approach should ensure that the agent retains knowledge from previous tasks while effectively adapting to new challenges.

9 Can we prove theoretical results for deep learning which use more than just formalized simple networks with linear activations (threshold gates)?

The exploration of whether a sufficient depth of deep neural networks can replicate all human behavior raises intriguing questions By creating a list of human actions ranked by the number of hidden layers required for their reproduction, we can better understand the complexities of neural network architectures This inquiry also ties into the Moravec paradox, which highlights the disparity between human cognitive abilities and the computational power needed for machines to mimic simple actions Understanding these relationships could advance our insights into artificial intelligence and its limitations in emulating human-like behavior.

11 Do we have a better alternative than simply randomly initializing weights? Since in neural networks everything is in the weights, this is a fundamental problem.

Local minima are a common challenge in deep learning architectures, raising the question of whether they are an unavoidable aspect of the technology or a limitation of current models While incorporating hand-crafted features can enhance performance, deep neural networks also possess the ability to autonomously extract features However, these networks often get trapped in local minima during training Curriculum learning has proven beneficial in certain scenarios, prompting an exploration of its necessity for various tasks.

13 Are models that are hard to interpret probabilistically (such as stacked autoencoders, transfer learning, multi-task learning) interpretable in other formalisms? Perhaps fuzzy logic?

14 Can deep networks be adapted to learn from trees and graphs, not just vectors?

The human cortex operates not just in a feed-forward manner but is fundamentally recurrent, indicating that most cognitive tasks involve this recurrence This raises the question of whether certain cognitive tasks can be effectively learned solely through feed-forward networks or if they require the capabilities of recurrent networks.

The Spirit of Connectionism and Philosophical Ties

Connectionism, now known as deep learning, is thriving and challenging the traditional dominance of GOFAI in artificial intelligence While reasoning remains a significant cognitive ability yet to be fully mastered by this approach, the future of its conquest is uncertain Despite facing near extinction in the past, artificial neural networks have emerged as a crucial element of AI and Cognitive Science, gaining a captivating allure largely due to effective marketing.

To create a masterpiece, a sculptor needs a clear vision and the necessary skills and tools Similarly, philosophy and mathematics, the oldest branches of science, provide a foundational framework for understanding various scientific disciplines This is particularly relevant in connectionism; when ideas are lacking, philosophy offers inspiration, while mathematics equips one with the essential tools Engaging in both fields can significantly enhance a career in science, including the study of neural networks.

As this book concludes, I hope you found the journey enriching Remember, this is just the start of your deep learning adventure Always pursue knowledge and challenge the status quo, ignoring any negativity from those who doubt your efforts or relevance Embrace the wisdom of the proverb: "Every day, write something new."

If you lack new ideas, consider revisiting old concepts, and if that fails, immerse yourself in reading Eventually, a visionary will emerge, overcoming significant challenges and resistance Neural networks symbolize this struggle—an embodiment of resilience and the journey from despair to triumph The life of Walter Pitts, a philosophical logician who sought knowledge in solitude and aimed to transform the world through logic, serves as a poignant reminder of perseverance His story is a source of inspiration for those striving to make their mark in history.

1 Y Bengio, Learning deep architectures for AI Found Trends Mach Learn 2(1), 1–127 (2009)

1 Books, journal articles, Arxiv, Coursera, Udacity, Udemy, etc—there is a vast universe of resources out there.

2 I do not know whose proverb it is, but I do know it was someone’s, and I would be very grateful if a reader who knows the author contacts me.

Convolutional layer, 122, 128 Convolutional neural network, 69, 79 Corpus, 75

Correlation, 72 Cosine similarity, 166 Covariance, 153 Covariance matrix, 153 Cross-entropy error function, 151 Curriculum learning, 117

D Datapoint, 52 Dataset, 54 1D convolutional layer, 122 2D convolutional layer, 123 Deep belief networks, 177 Delta rule, 88

Dot product, 26 Dropout, 115 Dunn coefficient, 71

E, 20 Early stopping, 114 Eigendecomposition, 155 Eigenvalue, 154 Eigenvectors, 154 Elman networks, 10, 141 Epoch, 65, 102, 117 Error function, 64, 89 Estimator, 36 Euclidean distance, 25 © Springer International Publishing AG, part of Springer Nature 2018

S Skansi, Introduction to Deep Learning, Undergraduate Topics in Computer Science, https://doi.org/10.1007/978-3-319-73004-2

Linear combination, 26 Linear constraint, 88 Linearly separable, 55 Linear neuron, 89 List, 18

Local minima, 115 Local receptive field, 122 Local representations, 72 Logistic function, 62, 67, 81, 140, 143 Logistic neuron, 90

M Markov assumption, 137, 150 Matrix transposition, 27 Max-pooling, 125, 128 Mean squared error, 89 Median, 33

Momentum, 114 Momentum rate, 115 Monotone function, 20 MSE, 102

Multiclass classification, 52 Multilayered perceptron, 87 Mutability, 18

N Naive Bayes classifier, 59 Necessary property, 107 Neural language models, 165 Neural turing-machines, 178 Neuron, 80

Noise, 74 Nonlinearity, 67 Normalized vector, 26 Numpy, 159

O One-hot encoding, 19, 56, 64, 128, 148, 166 Online learning, 102

Ordinal feature, 55Orthogonal matrix, 30Orthogonal vectors, 26Orthonormal, 26Overfitting, 107

Tiêu đề	Introduction to Deep Learning From Logical Calculus to Artificial Intelligence
Tác giả	Sandro Skansi
Trường học	University of Zagreb
Chuyên ngành	Computer Science
Thể loại	undergraduate topics in computer science
Năm xuất bản	2018
Thành phố	Zagreb

Định dạng
Số trang	196
Dung lượng	2,13 MB