# SpringMeeting2009Abstracts

### Devdatt Dubhashi

*Can a Computational Linguist Create a Genome?*

### Graham Kemp

*Can a Computational Linguist Fold a Protein?*

The protein folding problem is one of the most important and enduring problems in structural bioinformatics. The challenge is: given the sequence of amino acid residues in a protein chain, predict the 3D conformation that the chain will adopt. To enable experimentation with, and evaluation of, different algorithmic approaches, simplified toy problems have been proposed that share some of the characteristics of the protein folding problem, but which are far more tractable to computational solutions. One popular simplification is to use a lattice model, where the positions at which amino acid residues are placed are restricted to lie on a regular grid. Recently, linguistics-based dynamic programming methods have been applied to the problem of protein folding on a lattice. This talk will introduce the problem of protein folding on a lattice, describe the application of linguistics-based dynamic programming methods, and outline possible future directions. Some other applications of techniques from linguistics to problems in structural bioinformatics will also be mentioned.

### Jean Krivine

*Bigraphical Molecular Systems*

Bio molecular systems such as signaling cascades and gene regulation networks are highly distributed processes that may determine life and death of cells. Although each cascade may be studied independently and each gene may be associated with its own activation and inhibition domains, the behavior of a cell is only dictated by the integration of several and possibly contradictory signals. In order to disentangle the mechanisms underlying cell "decisions", Regev and Shapiro proposed to transfer theories used for the study and design of concurrent languages to the systems biologist's workbench. Along that line of research, Danos and Laneve have proposed to dissociate the problem of describing bio molecular systems (syntax) with the one of representing their dynamics (operational semantics). Drifting slightly away from process algebra based models, they proposed to use a form of concurrent graph rewriting language, called the kappa-calculus, in which nodes represent molecular entities (usually proteins) and edges complex formation.

After some general motivations, we shall see in this talk that kappa-calculus terms can be easily interpreted as a simple case of Milner's bigraphs. This will take us from a graph theory based language to a richer algebraic environment called Bigraphical Molecular Systems (BMS). In this formalism we will see that it becomes possible to describe fundamental biological mechanisms of action such as transport and compartmentalization which were not originally expressible in the kappa-calculus fragment.

### Femke van Geel and Aarne Ranta

*The GF Resource Grammar Summer School*

The GF Resource Grammar Library is an open-source computational grammar resource that currently covers 12 languages. The Summer School is a part of a collaborative effort to extend the library with new languages. 20 new languages have so far been proposed by the participants of the mailing list for the Summer School, which will be held in Gothenburg in August. The talk will give some highlights of what the library is useful for and what it involves to contribute to it.

### Ana Bove

*Another Look at Function Domains*

Bove and Capretta have presented a method to deal with partial and general recursive functions in constructive type theory which relies on an inductive characterisation of the domains of the functions. The method separates the logical and the computational aspects of an algorithm, and facilitates the formal verification of the functions being defined. For nested recursive functions, the method uses Dybjer' schema for simultaneous inductive-recursive definitions. However, not all constructive type theories support this kind of definitions.

Here we present a new approach for dealing with partial and general recursive functions that preserves the advantages of the method by Bove and Capretta, but which does not rely on inductive-recursive definitions. In this approach, we start by inductively defining the graph of the function, from which we \emph{first} define the domain and afterwards the type-theoretic version of the function. We show two ways of proving the formal specification of the functions defined with this new approach: by induction on the graph, or by using an induction principle in the style of the induction principle associated to the domain predicates of the Bove-Capretta method.

### Harald Hammarström

*Morphology Induction as a String Compression Problem*

### Torbjörn Karfunkel

*Multi-dimensional Analysis of Noisy Data*

### Peter Damaschke

*Probabilistic particle tracking - plans within the e-science initiative*

Particle tracking means to reconstruct the trajectories of particles and their collisions with atoms, from observed signals in a detector. Collisions generate showers of secondary particles that may again collide with atoms, etc. The problem appears, e.g., if one wants to recognize several variants of radioactive decay where different numbers of neutrons are emitted. Since only charged particles produce light signals, the number of incoming neutrons cannot be inferred with certainty. Given a picture from the detector, can we say how many neutrons entered the detector, and assign probabilities to these cases? Only this number is of interest, while trajectory reconstruction is only a means for probability calculations:

The scenario may be modeled as a Bayesian inference problem. A hypothesis h is a set of energy vectors of incoming neutrons. The data D is a set of collision points in space. (The time of collisions may be known as well.) Not all points may be visible, because some neutrons knock out only neutrons. Measurements of the energy of secondary particles and of the energy deposit in the detector layers is very inaccurate. It is appropriate to apply a coarse dicretization of these values. Probabibilty distributions of showers are derived from physical modelling and statistical material: For a neutron starting with a certain energy, distributions are assumed for the path length until the collision, the number and type of secondary particles, and their energies and angles. The whole process can be seen as an edge-labeled directed "forest", where each collision is a "star" with one in-edge and some out-edges. The out-degree is small and can also be zero. Due to the independence of collisions, the (conditional!) probability of a forest is simply the product of the probabilities of stars. Finally, P(h|D) is the sum of the probabilities of all forests having the same observed vertex set D. In order to find the hypotheses h with the highest likelihoods P(h|D) one has to infer the most likely forests. Instead of products of probabilities one can work with sums of negative logarithms called scores. This turns the problem into roughly the following: Given a set of stars with scores, connect them to spanning directed forests with minimum total scores; but compute enough alternative forests with a total probability close to 1.

Because of the "domino" condition (the labels of neighbored stars must be compatible), the problem is easily shown to be NP-complete. The algorithms currently used are unsatisfactory: Greedy algorithms tend to underestimate the number of neutrons, exhaustive search is time-consuming, and heuristics like simulated annealing do not guarantee the most likely solutions. An exact search algorithm combining branch-and-bound with a least-cost-first rule looks promising. We would generate partial solutions starting at the assumed primary collision points, and extend the solution with the least lower bound for the total score in every step. Efficiency relies on good lower bounds. One idea is to assign to every edge e the least score of a star with in-edge e, and to compute a minimum spanning tree of the yet uncovered points, ignoring for the moment the compatibility of stars. The intuition is that the cheapest solutions will typically contain many "locally cheapest" edges being involved in stars with reasonable angles and energy values.

### Dag Wedelin

*Identification of dynamical systems, benchmarks and e-Science*