\documentclass[compsoc,12pt,a4paper]{IEEEtran} \usepackage{blindtext} \usepackage{color} \usepackage[margin=2cm]{geometry} \usepackage{graphicx} \usepackage{listings} \usepackage{tikz} \usetikzlibrary{shapes,arrows} \usepackage{wrapfig} \definecolor{dkgreen}{rgb}{0,0.6,0} \definecolor{gray}{rgb}{0.5,0.5,0.5} \definecolor{mauve}{rgb}{0.58,0,0.82} \lstset{ frame=none, aboveskip=3mm, belowskip=3mm, showstringspaces=false, columns=flexible, basicstyle={\small\ttfamily}, numbers=none, numberstyle=\tiny\color{gray}, keywordstyle=\color{blue}, commentstyle=\color{dkgreen}, stringstyle=\color{mauve}, breaklines=true, breakatwhitespace=true, tabsize=3, escapeinside={(**}{*)} } \usepackage{cite} \graphicspath{{./img/}} \DeclareGraphicsExtensions{.PNG} \usepackage[cmex10]{amsmath} \interdisplaylinepenalty=2500 \usepackage{algorithm} \usepackage{algorithmic} \usepackage{amsmath} \usepackage{amssymb} \usepackage{wrapfig} % Define block styles \tikzstyle{decision} = [diamond, draw, fill=blue!20, text width=4.5em, text badly centered, node distance=3cm, inner sep=0pt] \tikzstyle{block} = [rectangle, draw, fill=blue!20, text width=5em, text centered, rounded corners, minimum height=4em] \tikzstyle{line} = [draw, -latex'] \tikzstyle{cloud} = [draw, ellipse,fill=red!20, node distance=3cm, minimum height=2em] % *** SUBFIGURE PACKAGES *** %\usepackage[tight,footnotesize]{subfigure} % subfigure.sty was written by Steven Douglas Cochran. This package makes it % easy to put subfigures in your figures. e.g., "Figure 1a and 1b". For IEEE % work, it is a good idea to load it with the tight package option to reduce % the amount of white space around the subfigures. subfigure.sty is already % installed on most LaTeX systems. The latest version and documentation can % be obtained at: % http://www.ctan.org/tex-archive/obsolete/macros/latex/contrib/subfigure/ % subfigure.sty has been superceeded by subfig.sty. %\usepackage[caption=false]{caption} %\usepackage[font=footnotesize]{subfig} % subfig.sty, also written by Steven Douglas Cochran, is the modern % replacement for subfigure.sty. However, subfig.sty requires and % automatically loads Axel Sommerfeldt's caption.sty which will override % IEEEtran.cls handling of captions and this will result in nonIEEE style % figure/table captions. To prevent this problem, be sure and preload % caption.sty with its "caption=false" package option. This is will preserve % IEEEtran.cls handing of captions. Version 1.3 (2005/06/28) and later % (recommended due to many improvements over 1.2) of subfig.sty supports % the caption=false option directly: %\usepackage[caption=false,font=footnotesize]{subfig} % % The latest version and documentation can be obtained at: % http://www.ctan.org/tex-archive/macros/latex/contrib/subfig/ % The latest version and documentation of caption.sty can be obtained at: % http://www.ctan.org/tex-archive/macros/latex/contrib/caption/ \hyphenation{op-tical net-works semi-conduc-tor} \begin{document} \title{\vspace{8.0cm}---EmerGen(e)tic---\\Exploring the use\\of genetic algorithms\\in emergent distributed\\systems} \author{\vspace{2.0cm}Ben Goldsworthy, 33576556\\Computer Science BSc} \maketitle \onecolumn \section*{Declaration} I certify that the material contained in this dissertation is my own work and does not contain unreferenced or unacknowledged material. I also warrant that the above statement applies to the implementation of the project and all associated documentation. Regarding the electronically submitted version of this submitted work, I consent to this being stored electronically and copied for assessment purposes, including the School's use of plagiarism detection systems in order to check the integrity of assessed work.\\ I agree to my dissertation being placed in the public domain, with my name explicitly included as the author of the work.\\ All working documents and results can be found at \lstinline|www.lancaster.ac.uk/ug/goldswor/scc300/|.\\\\ Date:\\ Signed: \newpage \begin{abstract} Adaptive and emergent systems exist to attempt to answer the deficiencies inherent to distributed systems, and the necessarily finite ability of any programmer to predict all possible eventualities in which his software may one day find itself. This paper argues that these systems fail to go far enough, and then proposes a further development---genetic systems---which utilises genetic programming to extend the versatility of a given system massively, if not infinitely. This paper then proceeds to detail the EmerGen(e)tic framework for rapidly testing genetic algorithm modules within emergent systems, as well as an example module pertaining to the cache updating behaviour of a web server. This paper concludes by proposing further avenues of potentially-fruitful research based upon these programs and its findings. \end{abstract} \newpage \section{Introduction} \label{sec:introduction} Coulouris \textit{et al.} (2012) define a \textit{distributed system} as `[...]one in which components located at networked computers communicate and coordinate their actions only by passing messages.'\cite{distsys} By the very nature of distributed computing, any system will---as a whole---experience a range of fluctuating conditions and environments that may affect its performance. Although a well-designed, reliable distributed system will be developed to be `as fault tolerant as possible'\cite{reliable} and with as many of these potential operating environments as possible in mind, the totality of all possible conditions for all possible futures in which the system remains in use is impossible for the original developer(s) to have been entirely predicted and accounted for. This inhibits the universality of current distributed systems, as well as incurring future maintenance and development costs when they need to be adapted for future circumstances.\\ \textit{Adaptive systems} are a subset of distributed systems in which the system is given a number of submodules that define different behaviour, and the system is programmed with policies for determining which submodule to use for a given task. An example may be a file server that has a compression module activate when serving a file over a given filesize. Whilst this adaptive behaviour does increase the system's operational range, it still requires manual programming of policies and thus suffers from the same issues as the general distributed systems.\\ \textit{Emergent systems} are a proposed solution to these limitations. Emergent systems are a subset of adaptive systems in which the system itself determines the best configuration to deal with a given situation by trying multiple configurations and assessing their performance against a given metric. This, coupled with some form of storage for recording previously-discovered optimum solutions that can be consulted heuristically for improved performance, provides a more adaptive final system whilst simultaneously reducing development overheads.\\ Filho \& Porter (2016) best outline the problem in response to which emergent systems have been developed, stating that (emphasis theirs) `[human-centric] approaches [to self-organising software] rely on prediction of how a system will respond to environments (which may turn out to be false) and result in \textit{inflexibility} [whilst a] \textit{machine-centric} approach [...] moves the burden of complexity into software itself, avoids the need for prediction of behaviours, and supports total flexibility [...]'.\cite{machinecentric}\\ However, these emergent systems are not too without their limitations. Whilst they do possess far more flexibility with regards to arranging themselves into new and unpredicted configurations than does a simpler adaptive system, they are yet constrained by the finite pool of components they have access to, which are still developed by human hand.\\ \begin{algorithm} \caption{Basic genetic algorithm} \label{alg1} \begin{algorithmic} \REQUIRE initial population \REQUIRE num. of generations \REQUIRE num. of candidates per generation \FORALL{generations} \FORALL{chromosomes in generation} \STATE{run test on chromosome} \RETURN{fitness score of chromosome} \ENDFOR \FOR{strongest chromosome(s) in generation} \STATE{copy over to next generation} \ENDFOR \FOR{remaining chromosomes in generation} \IF{small probability = true} \STATE{apply mutation/crossover operation} \ENDIF \STATE{copy over to next generation} \ENDFOR \ENDFOR \end{algorithmic} \end{algorithm} \textit{Genetic algorithms} rely on metaphors from evolutionary biology such as random mutation and natural selection in order to develop---over the course of multiple iterations and utilising a finite set of operations---improved software, with each generation ideally moving closer towards an optimal solution as specified via the result of a given fitness function $f\left(\right)$. They have been used to produce everything from spaceship antennae\cite{antenna} to software bug checkers.\cite{genprog} In a previous work,\cite{ants} this author applied genetic algorithms to the development of simulated ants. Algorithm \ref{alg1} shows the algorithm by which genetic programming is performed.\\ On the one hand, we have systems that need to be freed from human development as much as possible in order to be generalisable to the widest range of possible operating environments. On the other, we have a programming approach in which the human developer sets up a number of variables before reclining whilst the computer handles the rest of the time-consuming programming work---the two fields would appear to be ready allies. If genetic programming could be effectively applied to the generation of components for an emergent software system, it would allow the creation of an infinite pool of potential components and allow the system to theoretically surmount any given situation, given enough time.\\ Thus, the research questions that this project aims to satisfy are: \begin{itemize} \item can genetic algorithms be productively used to produce optimum components for an emergent system? and \item assuming so, what are the optimal conditions for doing so (i.e. mutation probabilities, generation sizes, etc.)? \end{itemize} From these research questions come the aims for this project; they are: \begin{itemize} \item to produce a framework for enabling rapid development of genetic improvement tests, \item to test a small, simple program to hopefully demonstrate improvement through application of these techniques and \item to repeat the tests multiple times with a variety of conditions and see what, if any, trends emerge. \end{itemize} This report is divided into eight sections: \begin{enumerate} \item `Introduction', in which the broad goal of the project in question is outlined, \item `Background', in which the history of the field in question is outlined in more detail, \item `Design', in which the design of the present solution is outlined in a non-technical manner, from first principles, \item two `Implementation' sections, in which the technical implementation of both the EmerGen(e)tic genetic testing framework and the specific \lstinline|cachingpolicy| module is detailed, \item `The System in Operation', in which usage---and normal output---of the systems above are described, \item `Testing \& Evaluation', in which the results of testing on the systems above are investigated and the project as a whole evaluated, with its successes and failures candidly discussed and \item `Conclusion', in which the research questions above are revisited and avenues for future research signposted. \end{enumerate} \newpage \section{Background} \label{sec:background} \subsection{Assumptions} This project operates on a foundation of two assumptions, neither of which---though they may both appear to be---are axiomatic.\\ \noindent \textbf{Assumption \#1: Programming is hard}\\ \noindent IBM found that the cost of fixing a bug can range from a base cost (for one discovered during the requirements elicitation phase of software design) to 100$\times$ that (for one identified in the maintenance phase).\cite{bugcost} With the lengths of written software steadily increasing---1993's Windows NT 3.1 had 4.5m lines of code to 2003's Windows Server 2003's 50m---and an average of `about 15--50 errors per 1000 lines of delivered code',\cite{1000} these problems are being exacerbated.\footnote{For further exploration by myself of the resultant security risks of this, amongst other factors, see \cite{cni}.} As such we desire to systems that will perform as many programming responsibilities on our behalf as possible, as well as as much of the checking, testing and validation work as we can get away with---repetitive, methodical tasks such as testing especially represent the perfect area for a computer, but not man. We also desire systems that can be constructed out of smaller, simpler parts that can be more rigorously tested prior to deployment.\footnote{cf. the `Unix philosophy', best detailed in \cite{unix}} It is these goals that has led to the development of distributed, adaptive and emergent systems, as well as genetic programming.\\ \noindent \textbf{Assumption \#2: Circumstances change}\\ \noindent It would be hubris \textit{par excellence} to assume that humanity has learnt all that there is to learn and can be surprised no longer. Bronze age man discovered ironwork, assumptions of classical physics were rent asunder by the discovery of quantum physics and the discovery of non-Euclidean geometry is said to have `marked the end of an entire line of human thought, one that had dominated intellectual efforts in the West for centuries.'\cite{geometry} No one programmer---nor a team of programmers---can hope to have perfect, 20/20 foresight encompassing all of the environmental possibilities in which their system may find itself deployed in the future. Even if they could, theoretically, handle all possible situations as are known to them now, there is every possibility of a non-Euclidean geometry-esque upending of the conventional wisdom occurring at an unspecified point in the future, entirely incomprehensible from within the bounds of their Euclidean worldview---they would be akin to the denizens of Abbott's \textit{Flatland}, trying to conceive of the third dimension. \subsection{Distributed and Adaptive Systems} \textit{Distributed systems}---amusingly described by Lamport (1987) as `[a system] in which the failure of a computer you didn't even know existed can render your own computer unusable'\cite{email}---have their origins in the early days of computing and the concept of multiprogramming. Carr, Crocker \& Cerf (1970) outlined ARPANET, `one of the most ambitious computer networks attempted to date'.\cite{rfc} As one of the first networks, ARPANET serves an example of one of the first instances of distributed computing and a predecessor to the modern-day Internet. Coulouris \textit{et al.}'s definition of a distributed system as `[...]one in which components located at networked computers communicate and coordinate their actions only by passing messages'\cite{distsys} may, however, begin to make the drawbacks of such a system apparent. As these computers are in different geographic locations, their needs and experiences may differ---a program written in California may work fine on the fast, reliable internet connections that are all its developer may have ever experienced, but less so for the instance of it running in rural India.\\ One solution to these issues is \textit{adaptive systems}, in which a distributed system is given a number of submodules containing different approaches to its tasks. The developer will then program in various policies for which modules to use in which conditions. Examples of this could include a live online video player that resorts to a different quality of transmission when being broadcast to a client with a suboptimal network connection. Again, the Internet is an excellent example of an adaptive system: with various protocols on offer, a developer can choose the one that best suits a given need. If error detection and correction is important, TCP is the tool for the job; if it is not, then UDP provides a faster transmission with less overhead.\\ However, these adaptive systems fail to resolve the initial issue of the developer having to predict the myriad situations his program may find itself running in in the future. By still requiring the developer's input in specifying the various module combination strategies and policies, all these systems achieve is to allow the developer to write more generalised, less specific (sub)programs. This may be beneficial for rapid reuse elsewhere, but is otherwise not the solution we seek. \subsection{Emergent Systems} One such proposed solution, however, is the \textit{emergent system}, or an adaptive system in which the program itself (or a framework on top of it) designs its own policies based on trial-and-error and recording performance rates of different combinations of submodules, the name coming from the possibility of such a system discovering optimal behaviour unpredicted by its human creators.\\ Dana\footnote{Dana documentation and downloads can be found at \lstinline|www.projectdana.com|} is a component-based language designed specifically for emergent system programming by Porter (2014), described by the author as `an imperative, procedural, interpreted language, [that] is multi-threaded, and features only interface, record and primitive types'\cite{dana}---it is also `syntactically similar to contemporary languages like Java'\cite{dana}.\\ Filho \& Porter (2016a) describe a web server written in Dana.\footnote{For an explanation of the function of Filho \& Porter's web server, as well as concepts such as `caching' and `compression', see \S \ref{sec:design}} The web server receives requests for files---a mixture of file types and sizes---from clients and serves them up. The server can construct behaviour out of a number of components that enable different policies for cache updating, file compression, etc. The authors test the server in every possible configuration and against different styles of workload, measuring the request response time per configuration.\\ The authors showed that different configurations gave markedly different results---for example, for a `Workload 1 [consisting] of one client repeatedly requesting only one text-only HTML file, [a particular architecture] performs best because, in this configuration, the web servers always compress the requested files, and once the file is returned to the load balancer, it is stored in a small content cache at the load balancer.'\cite{machinecentric} Meanwhile, a cacheless architecture performed better at a workload consisting of `one client requesting a different text-only HTML file for every request'\cite{machinecentric} as caching would provide no benefit for a constantly-changing series of requests and would only add pointless overhead.\\ In a further paper, Filho \& Porter (2016b), the authors go onto elucidate the limitations of a simple adaptive system, arguing that `[the] requirement [for human specification of control strategies] is fundamentally opposed to the core ideas behind autonomic computing, which are borne of the increasing difficulty for humans to understand modern software systems in dynamic environments.'\cite{losingcontrol} Developing further their web server platform, they implement a perception and learning system so that the web server can analyse the performance of varied configurations and determine an optimum arrangement for a given task. Finally, in Filho \textit{et al.} (2016), this is consolidated into the single framework RE$^{\text{X}}$, which `produce[s] systems that are responsive to the actual conditions that they encounter at runtime, and the way they perceive their behaviour in these conditions.'\cite{rex}\\ Other frameworks within the same field of emergent systems exist, such as Elkhodary, Esfahani \& Malek (2010)'s FUSION,\cite{fusion} but I assert the following with this paper: that the very idea of emergent systems as a whole fails to go far enough towards solving the issues it was formulated in response to. \subsection{`Genetic Systems'} Floyd (1979) said that the following words were written on the wall of a Stanford University graduate student office: `I would rather write programs to help me write programs than write programs.'\cite{floyd} This is a succinct description of the goals of genetic programming, in which representations (usually tree representations) of computer programs are modified using a set of mutation and crossover operators and a set of varying probabilities for each to occur. These changes take place across multiple generations and should, hopefully, trend towards better-performing programs. The applications are vast: Le Goues \textit{et al.} (2012) applied them to automated software bug fixers;\cite{genprog} Hansen \textit{et al.} (2007) to counter-cyberterrorism;\cite{security} Hornby \textit{et al.} (2015) to the shape of spacecraft antennae;\cite{antenna} and the author of this paper to the production of simulated ants.\cite{ants}\\ Indeed, the wall of Stanford's student office neatly encapsulates the thinking behind the proposal of this paper as to the viability and desirability of applying genetic programming techniques within the emergent system space--what this author will propose to be described as `genetic systems'. If an emergent system can already assemble configurations of provided submodules, test them and rank them, and the goal is to eliminate the developer from the software development equation as much as possible, then what is next? Why not have the system assemble the submodules themselves? This would increase the pool of potential submodules available to the program to be theoretically infinite, and the same benefits of this emergent behaviour would apply here too. It is this hypothesis that is to be tested within this paper. \newpage \section{Design} \label{sec:design} Within this section the design of the EmerGen(e)tic (from `emergent' and `genetic') system is outlined. Following that, the web server and caching system used within the \lstinline|cachingpolicy| module are outlined, from first principles, for the non-technical user's understanding. For a technical overview, see \S \ref{sec:cachingpolicyimpl}. \subsection{EmerGen(e)tic} The primary goal of this project was to produce a framework for quickly and easily allowing future projects to test the effects of genetic algorithms on varying elements of varying systems. To do this, a master Dana program was created which runs through each generation, calling a separate file (hereafter referred to as a \textit{module}) to handle all of the the evolutionary logic---the mutations, crossovers, selections, etc. This is so that future developers can easily plug `n' play with their own genetic modification modules in the future. The master program will then run through all the candidates from each generation to test their responses to differing stimuli. The testing functionality is contained within a single method, and so should be the only part of the EmerGen(e)tic code that a future developer needs to modify to implement their own project.\\ To assist with generalisability, as many settings as possible (e.g. mutation probability, file(s) to test with, number of generations) were left to be externally set via either command-line arguments or config. files.\\ The below flowchart shows the intended operation of the system:\\ \begin{center} \begin{tikzpicture}[node distance = 4.5cm, auto] % Place nodes \node [block] (init) {EmerGen(e)tic}; \node [block, left of=init] (results) {Results}; \node [block, below of=results] (script) {Mutation module}; \node [block, below of=script] (generation) {Generation $x$}; % Draw edges \path [line] (init) -- (script) node[midway,left,align=center] {1. For each gen.\\call module}; \path [line] (script) -- (generation) node[midway,left,align=center] {2. Create next\\gen.'s pop.}; \path [line] (init) |- (generation) node[midway,right,align=center] {3. Test generation}; \path [line] (init) -- (results) node[midway,above,align=center] {4. Output\\results}; \end{tikzpicture} \end{center} \subsection{The Web Server} \subsubsection{How a web server works} A \textit{web server} is a type of computer. It can receive requests from other computers (called `clients') for the files that are located within it via a number of methods, usually across an Internet connection, such as HTTP. Once it has received a request, which will contain the name of the file requested, it searches its file storage for a match. If it finds it, it sends the data of the file back to the client. If it does not, it should return a message that the client knows to interpret as an error message. Below is a flowchart of how a web server works:\\ \begin{center} \begin{tikzpicture}[node distance = 6cm, auto] % Place nodes \node [block] (client) {Client}; \node [block, below of=client] (webserver) {Web Server}; \node [block, right of=webserver] (file) {File Store}; % Draw edges \path [line] (client) -- (webserver) node[midway,left,align=center] {1. Client requests\\file}; \path [line] (webserver) -- (file) node[midway,above,align=center] {2. Server retrieves\\file from\\file store}; \path [line] (file) -- (webserver); \path [line] (webserver) -- (client) node[midway,right,align=center] {3. Server returns\\file\\(or error)}; \end{tikzpicture} \end{center} \subsubsection{Added extras} This is the most basic form of web server---for every request, it trawls through its files for a match and then returns that to the client. However, there are numerous extra functions that can be added to improve performance, and which can be combined to best suit various situations. For example, files can be compressed through various schemes in order to reduce their filesize before sending, and then decompressed at the client's end, which shortens the amount of time taken to transmit the (now-smaller) file.\\ However, the one most relevant to this project is \textit{caching}. In caching, a cache is kept---this is a section of storage that is faster to retrieve data from than the regular file store, but which does not have the size to hold the full file store. When a file is returned for a client's request, it is also added to the cache. This means that if a client then requests the same file again, it will be found within the cache and returned faster than if the entire file store had to be searched.\\ Obviously, the cache will at some point fill up. At this point, various policies for updating a full cache can be implemented. The simplest is to simply start again at the beginning and overwrite the first item. More complex ones can bring in variables like which file was requested least recently, or has been requested the fewest times, in order to try and improve speed. Again, different policies suit different situations, and there is no `magic bullet'. \subsubsection{A tale of two web servers} The present project is based off of Filho \& Porter's Dana web server code. Their project includes a client program, which is given a list of files grouped by various criteria (e.g. all large files, all image files, mixtures of both, etc.). It then sends the server program HTTP requests for each file in the list until it reaches the end. In this particular version of the software Filho \& Porter's PAL system is implemented, meaning every possible configuration found is tried---this means that each file is tested with every combination of the provided caching and compression methods, and the response time of the server (in ms) recorded. This response time is the metric by which the PAL system ranks various configurations.\\ The goal of Filho \& Porter's project was to demonstrate improvements in file handling when using different combinations of policies for different types of request patterns. However, for the present project, the system is too complex to control all of the independent variable required for the tests to be run. As such, the scope of inquiry had to be limited to caching policy only. As such, much of Filho \textit{et al.}'s server project code was removed. This left only the caching functionality and some remnants of the web server code. \subsection{Caching} \label{sec:caching} Cache behaviour is located within the \lstinline|/cache/CacheHandler*.dn| family of components. Filho \& Porter's server code contains six \lstinline|CacheHandler*| variants, each with different policies for determining which item in a full cache---represented by a fixed-size array---to replace with a newly-requested file, as well as a simple cache system in which the cache is only one item long. The other methods are: \begin{itemize} \item serial replacement, where each item in turn is replaced, returning to the first item upon reaching the end of the array, \item most- and least-frequently used, \item most- and least-recently used. \end{itemize} For example, \lstinline|CacheHandlerMFU.dn| contains the policy for replacing the most-frequently used cache item.\\ Within each of these variant cache handling policy files there is necessarily a lot of duplicated code. Common to all variants are the method signatures (but not bodies) for: \begin{itemize} \item \lstinline|updateCache()|, which updates a full cache using the variant's specified policy, \item \lstinline|clearCache()|, which clears the cache completely, and \item \lstinline|getCachedResponse()|, which returns either the item that was received from the cache or \lstinline|null| to the caller. \end{itemize} Cache initialisation is handled as a conditional within \lstinline|updateCache()| (i.e. if there is no cache present, it creates one and places the current item in cache index 0).\\ There are also a series of functions that are only present in individual \lstinline|CacheHandler| files, such as \lstinline|mostRecentlyUsed()| in \lstinline|CacheHandlerMRU.dn|. This does what is says on the tin, returning the most recently-used item from the cache to be overwritten. Another example is \lstinline|random()| (within \lstinline|CacheHandlerR.dn|), which unsurprisingly returns a random item from the cache.\\ For the genetic algorithm to work it needs a base file. The base file requires all of the common functionality present and marked as out-of-bounds for the genetic algorithm. Included in these out-of-bounds methods would be the full complement of methods such as \lstinline|mostRecentlyUsed()|, which could be called upon or not as decided by a given output program. \subsection{The `Cache Policy' Genetic Algorithm} The basic workings of a genetic algorithm were covered in \S \ref{sec:introduction}, but here the specifics of the algorithm as applied to this project shall be outlined.\\ As this project is concerned only with modifying cache updating behaviour, and thus in how to produce an index value given differing formulae, the most important line is line 110 of \lstinline|CacheHandlerBase.dn|: \begin{lstlisting}[language=Java] index = 0 \end{lstlisting} The formula on the right-hand side of that assignment is the scope for mutation. From that initial `\lstinline|0|' must sprout new formulae---these formulae are what is referred to in the remainder of this report as \textit{chromosomes}. A single \textit{generation} will consist of multiple \lstinline|CacheHandler*.dn| files (or \textit{candidates}), each of which will contain a single chromosome. Within each generation, therefore, shall exist multiple distinct chromosomes. Each of these sets of chromosomes is known as the generation's \textit{population}.\\ As mentioned before, the basic tools of a genetic algorithm are the operations of \textit{mutation} and \textit{crossover}, which shall be discussed below (along with a heuristic known as `elite selection'):\\ \noindent\textbf{Mutation}\\ \noindent In mutation, a section of chromosome $A$ is replaced with a new value. For example, from the example chromosome `\lstinline|0|', an operand mutation operation could change it to `\lstinline|4|'. For the mutation within the scope of this project, four types of mutation were implemented: \begin{itemize} \item binary operator mutation (e.g. `\lstinline|2+4|' into `\lstinline|2*4|'), \item operand mutation, \item unary operator mutation (e.g. `\lstinline|nthMostRecentlyUsed(2)|' into `\lstinline|nthMostFrequently- Used(2)|'), and \item subtree creation (e.g. `\lstinline|2+2|' into `\lstinline|2+(4*2)|'). \end{itemize} The first three simply randomly change elements of the chromosomes in the hopes of finding particularly efficacious variants. The fourth, however, serves the vital role of allowing the chromosomes---which all start as `\lstinline|0|'---to increase in complexity.\\ \noindent\textbf{Crossover}\\ \noindent In crossover, two chromosomes $A$ and $B$ are selected. A random section of $A$ ($A'$) is then taken and inserted at a random point of $B$ ($B'$) to produce a new chromosome $C$. For example, take \begin{gather*} A = \text{ \lstinline|2+(4+3)|}\\ B = \text{ \lstinline|4*3|} \end{gather*} and an $A'$ of `\lstinline|+(4+3)|' and a $B'$ between the `\lstinline|4|' and the `\lstinline|*|'---we produce \begin{gather*} C = \text{ \lstinline|4+(4+3)*3|} \end{gather*} Variants of crossover also exist, such as multi-point crossover (where multiple $A'$s are copied over to multiple $B'$s). However, for this project, only single-point crossover was implemented due to time constraints.\\ \noindent\textbf{Elite Selection}\\ \noindent There are a number of time-saving tactics (or \textit{heuristics}) that can be used to speed up the performance of a genetic algorithm. One such heuristic, implemented within this project, is \textit{elite selection}. This takes an arbitrary percentage of top-performing chromosomes for any generation (in this instance, the top 10 \%) and copies them across to the next generation unchanged. This ensures that optimal solutions are not simply discarded upon the next generation. \newpage \section{Implementation: EmerGen(e)tic} \label{sec:implementation} This section contains a more technically-detailed overview of the implementation of the EmerGen(e)tic program. \subsection{Folder Structure} The base folder structure for EmerGen(e)tic is as follows: \begin{itemize} \item \lstinline|/archives/|, where archives of test run data go when after completion of a script's run, \item \lstinline|/project/|, which contains the mutation project modules, \item \lstinline|/resources/|, which contains Dana overhead, \item \lstinline|/results/|, which contains per-generation and per-script results files and \item \lstinline|/scripts/|, which contains \lstinline|.script| files for running through multiple tests, grouped by some criterion/criteria. \end{itemize} \subsection{\lstinline|emergenetic.dn|} This Dana file contains the framework for running and testing genetic algorithm projects.\footnote{See Appendix \ref{appendix:emergen}} Passed to it are a number of command-line arguments, including the module to load and the number of generations to test. \subsubsection{\lstinline|App:main()|} The program is entered through the \lstinline|App:main()| method, whereupon the passed arguments are validated---if any are invalid, detailed error messages are output and the program terminates. The method runs the setup file \lstinline|setup.sh| from within the passes module directory (and passes the result to a variable of type \lstinline|RunStatus|, which can be helpful for debugging), and then iterates through each generation. For each generation, it runs the given module's \lstinline|mutator.py| program, followed by the \lstinline|runGeneration()| method.\\ Upon successfully finishing the iteration of each generation, the program exits with a code of `0'---in the event of invalid arguments, it exits with a code of `1'. \subsubsection{\lstinline|runGeneration()|} The \lstinline|runGeneration()| calls the \lstinline|runCandidate()| method for each candidate within the given generation. Once this is done, it appends a newline to the \lstinline|results.csv| file within the \lstinline|results| directory. \subsubsection{\lstinline|runCandidate()|} This is the point at which a future developer will implement their own code for running and testing candidates for their own projects. In the base release of \lstinline|emergenetic.dn|, it is left blank. \subsubsection{\lstinline|printResults()|} This method prints the results of each test run to output. It then also appends the results of the run to the per-generation results file \lstinline|results|$x$\lstinline|.txt|, and to the per-script results file \lstinline|results.csv| within the \lstinline|results| directory. \newpage \section{Implementation: \lstinline|cachingpolicy|} \label{sec:cachingpolicyimpl} In this section, the technical implementation of the \lstinline|cachingpolicy| module used for this project is detailed. \subsection{Folder Structure} In addition to the base folder structure of EmerGen(e)tic, the following directories were added. \begin{itemize} \item \lstinline|/cache/|, where the generations of various \lstinline|CacheHandler*.dn| files are kept, each within a directory numbered with their generation number and in files of the format \lstinline|CacheHandler|$x$\lstinline|_|$y$\lstinline|.dn|, where $x$ is the number of the generation and $y$ is the number of the candidate, \item \lstinline|/cachebackup/|, which contains a fresh copy of \lstinline|CacheHandlerBase.dn| for quickly clearing the \lstinline|/cache/| directory after a full test run, \item \lstinline|/htdocs/|, which contains all the files to request from the web server, \item \lstinline|/project/cachingpolicy|, which contains the files for the \lstinline|cachingpolicy| module and \item \lstinline|/resources/cache|, which contains the Dana specifications for the caching components. \end{itemize} \subsection{Genetic Algorithm search space} Before a genetic algorithm can be implemented, the search space must be defined: \begin{gather*} \text{Let }c\text{ represent the cache length} \\ \text{Let }l\text{ represent an AST leaf values} \\ \text{Let }n\text{ represent an AST node values} \\\\ l \in \mathbb{N} \\ n \in \left\{\times, +, -, \div, MostFrequent\left(\right), MostRecent\left(\right), Rand\left(\right) \right\} \\\\ \text{Let }a\left(n\right)\text{ be a function that returns the arity of function }n \\ \text{Let }r\text{ represent the result of }n\left(\right)\text{, }n\left(l\right)\text{ or }n\left(l_1, \dots, l_{a\left(n\right)}\right) \\ \text{Let }i\text{ represent the cache index value to replace} \\\\ r\in \mathbb{N} \\ i \in \left\{x \in \mathbb{N} : 0 \leq i \leq c-1\right\} \\ \therefore i = r \bmod{\left(c-1\right)} \end{gather*}\\ In real terms then, the the set of actual Dana operators and methods $n_{Dana}$ used was \[ n_{Dana} \in \left\{\text{\lstinline|*|}, \text{\lstinline|+|}, \text{\lstinline|-|}, \text{\lstinline|/|}, \text{\lstinline|nthMostFrequentlyUsed()|}, \text{\lstinline|nthMostRecentlyUsed()|}, \text{\lstinline|random()|}\right\} \] and the resolve flag $res$, which was passed through to \lstinline|nthMostFrequentlyUsed()| and \lstinline|nthMostRecentlyUsed()| as a constant in each individual program, was of the value \[res \in \left\{\text{\lstinline|n|}, \text{\lstinline|o|}, \text{\lstinline|r|}\right\}\] \subsection{\lstinline|CacheHandlerBase.dn|} From the various \lstinline|CacheHandler*.dn| files within Filho \& Porter's \lstinline|/cache/| were distilled the common elements of each (the methods covered in \S \ref{sec:caching}), collected into one file.\footnote{See Appendix \ref{appendix:base}} \subsubsection{\lstinline|CacheHandler:getGetCachedResponse()|} This method returns an item specified in a request from the cache if it exists there---otherwise, it returns \lstinline|null|. \subsubsection{\lstinline|CacheHandler:updateCache()|} This method updates the cache if a requested item is not already in it. If the cache is full, this method contains the formula for determining the cache index to replace. Lines 70--71---demarcated with the comments \lstinline[language=Java]|// BEGIN| and \lstinline[language=Java]|// END|---indicate to the genetic algorithm its bounds of operation. Within these lines, the method for determining cache index $i$ is determined for each chromosome. \subsubsection{\lstinline|CacheHandler:clearCache()|} This method clears the cache completely. \subsubsection{\lstinline|nthMostFrequentlyUsed()|, \lstinline|nthMostRecentlyUsed()| \& \lstinline|random()|} These methods return the item in the cache that fits their criterion (or, in the case of \lstinline|random()|, a random item). Including all of the various most/least frequently/recently used methods would lead to an unnecessarily large file, so they were generalised into the methods \lstinline|nthMostFrequentlyUsed()| and \lstinline|nthMostRecentlyUsed()|, which each take their $n$ as an argument. \subsubsection{\lstinline|resolve()|} The \lstinline|resolve()| method takes a flag indicating how to resolve the situation of multiple returns from one of the aforementioned three methods (e.g. in the event that the least-frequently used file is not the only file in the cache with that number of hits). The flag can be set to return either the newest item, the oldest or a random choice. \subsection{\lstinline|emergenetic.dn|} Within the \lstinline|runCandidate()| method,\footnote{See Appendix \ref{appendix:runcand}} the \lstinline|CacheHandler| component within \lstinline|CacheHandler*.dn| is loaded using Dana's \lstinline|RecursiveLoader|, which ensures that all of its dependencies are also loaded. A timer is started and each file within the given script file is requested from the web server. When the script file is finished, the timer is stopped and the overall time taken (and that the value represented ms) is passed to \lstinline|printResults()|. As \lstinline|RecursiveLoader| lacks an \lstinline|unload()| method, each of the loaded components is then looped through and \lstinline|unload()| called on them in turn. \subsection{\lstinline|cachingpolicy/mutator.py|} \label{sec:python} This Python file contains the genetic algorithm logic for the \lstinline|cachingpolicy| module.\footnote{See Appendix \ref{appendix:genetic}} It can be run as a standalone Python script or imported as a Python module into other projects. \subsubsection{\lstinline|getSubLists()|} This method recursively produces a \lstinline|List| of nested \lstinline|List|s representing every subtree present within a given chromosome (i.e. every bracketed expression). \subsubsection{\lstinline|crossover()|} This method performs single-point crossover between two chromosomes passed to it as arguments $A$ and $B$. It first calls \lstinline|getSubLists()| on $A$ in order to get a \lstinline|List| of all the possible subtrees of $A$. It then randomly chooses points within $B$ until it finds an operand token, at which point it replaces the operand with a randomly-chosen \lstinline|List| from $A$. \subsubsection{\lstinline|mutate()|} This method performs four distinct multiple-point mutations. If it is the first run it imports the individual mutation probabilities from \lstinline|config.conf|, and if it is the initial population it overrides the mutation probability to be 100 \%. It then recursively iterates over each token in the chromosome and has a chance to possibly apply one of the mutation operations. The recursive element comes in when it encounters a token of type \lstinline|List| (i.e. a bracketed expression), in which case it calls \lstinline|mutate()| upon the sublist. \subsubsection{\lstinline|parse()| \& \lstinline|compile()|} The former of these methods takes an expression as a string (e.g. `\lstinline|2+(4*2)|') and returns a \lstinline|List| representation (with nested \lstinline|List|s for bracketed expressions). The latter takes such a \lstinline|List| and returns the original string expression. \subsubsection{\lstinline|createInitialPop()|} If the python script is called with a generation argument of `0' it calls the \lstinline[language=python]|createInitialPop()| method first. This creates an initial population by copying over \lstinline|CacheHanderBase.dn| and applying \lstinline[language=python]|mutate()| to each candidate with a 100 \% mutation probability. \subsubsection{\lstinline|readChromosomeFromFile()| \& \lstinline|writeChromosomeToFile()|} These methods are simple utility methods for reading and writing chromosomes from and to \lstinline|CacheHandler*.dn| files. \subsubsection{\lstinline|hasSubTrees()|} This method returns a boolean indicating whether a passed chromosome has nested expressions or not. \subsubsection{\lstinline|main()|} The \lstinline|main()| method takes the $n$ candidates of previous generation $m$ (passed to it as arguments) and selects the top 10 \% of them to go across unchanged to generation $m+1$. Of the remainder, probabilities of mutation and crossover are applied to each and, when those occur, separate probabilities determine the type of mutation or crossover that will occur. \subsection{\lstinline|cachingpolicy/config.conf|} \lstinline|mutator.py| reads in its probability values from the file \lstinline|config.conf|.\footnote{For an example \lstinline|config.conf|, see Appendix \ref{appendix:settings}} This file contains a number of name-value pairs. The left hand side of each config. pair is ignored by \lstinline|mutator.py| and is thus present only for the aid of the human user. \newpage \section{The System in Operation} In this section is outlined the process of creating a new module---and running a suit of tests using it---via the EmerGen(e)tic framework. \subsection{Creating a Module} \subsubsection{Files} To create a new module for a future research project, create the folder within the \lstinline|project| directory. Within this directory, create a Python file \lstinline|mutator.py|, which will contain all of your genetic algorithm and a Bashscript file \lstinline|setup.sh| to perform any functions prior to each test such as creating or deleting folders (this may not perform any tasks, but the file must still be present).\footnote{For an example \lstinline|setup.sh|, see Appendix \ref{appendix:setup}} \subsubsection{Implementation} To implement the new module and test the results of it, code must be written within the \lstinline|runCandidate()| method of the Dana file \lstinline|emergenetic.dn|. Within this method will be defined the process of testing, how to use script files and the metric by which success is measured. The other methods within the file will remain unchanged. \subsection{Creating Scripts} In order to run a suite of different tests for each candidate, scripts are required. In the \lstinline|cachingpolicy| instance, these script files contain the filenames of different files grouped by some property or properties. For example, \lstinline|difffile-html.script| tests a number of different files, all of which are HTML but are a mixture of sizes, whilst \lstinline|difffile.script| tests different files that are a mixture of HTML and image and have a range of sizes. This can be used to easily showcase the system producing different results that provide optimum behaviour based on different operational conditions. \subsection{Running EmerGen(e)tic} After compiling the \lstinline|emergenetic.dn| file, the program can be run. It expects four command-line arguments, and can take a fifth optional one. The command is as follows: \begin{lstlisting}[language=bash] dana emergenetic