Biological informatics. What bioinformatics can do

If you ask a casual passer-by what biology is, he will probably answer something like "the science of wildlife". About computer science he will say that it deals with computers and information. If we are not afraid to be intrusive and ask him the third question - what is bioinformatics? - this is where he will probably get confused. It is logical: not everyone knows about this area of \u200b\u200bknowledge even at EPAM - although our company also has bioinformatics. Let's figure out why this science is needed by humanity in general and EPAM in particular: in the end, all of a sudden we are asked on the street about this.

Why biology has ceased to cope without informatics and what does cancer have to do with it

To conduct research, it is no longer enough for biologists to take tests and look through a microscope. Modern biology deals with colossal amounts of data. Often it is simply impossible to process them manually, so many biological problems are solved by computational methods. Let's not go far: the DNA molecule is so small that it is impossible to see it under a light microscope. And if it is possible (electronically), still visual study does not help to solve many problems.

Human DNA consists of three billion nucleotides - to manually analyze them all and find the right site, it is not enough whole life... Well, maybe that's enough - one life to analyze one molecule - but it's too long, expensive and unproductive, so the genome is analyzed using computers and calculations.

Bioinformatics is the whole set of computer methods for analyzing biological data: read structures of DNA and proteins, micrographs, signals, databases with experimental results, etc.

Sometimes DNA sequencing is needed to find the right treatment. The same disease, caused by different hereditary disorders or environmental exposure, needs to be treated in different ways. There are also regions in the genome that are not associated with the development of the disease, but, for example, are responsible for the response to certain types of therapy and drugs. Therefore, different people with the same disease may respond differently to the same treatment.

Bioinformatics is also needed to develop new drugs. Their molecules must have a specific structure and bind to a specific protein or piece of DNA. Computational methods help to model the structure of such a molecule.

Achievements of bioinformatics are widely used in medicine, primarily in cancer therapy. DNA encodes information about susceptibility to other diseases, but the most work on the treatment of cancer. This direction is considered the most promising, financially attractive, important - and the most difficult.

Bioinformatics at EPAM

At EPAM, Bioinformatics is handled by the Life Sciences division. There are developing software for pharmaceutical companies, biological and biotechnology laboratories of all sizes - from startups to the world's leading companies. Only people who understand biology, know how to compose algorithms and program can cope with such a task.

Bioinformatics are hybrid specialists. It is difficult to say which knowledge is primary for them: biology or computer science. If you put the question this way, they need to know both. Most important, perhaps, analytical warehouse intelligence and willingness to learn a lot. In EPAM there are biologists who completed their studies in computer science, and programmers with mathematicians who additionally studied biology.

How bioinformatics become

Maria Zueva, developer:

“I got a standard IT education, then I took courses in the EPAM Java Lab, where I became interested in machine learning and Data Science. When I graduated from the laboratory, I was told: "Go to Life Sciences, they are engaged in bioinformatics and they are recruiting people." I'm not cunning: then I heard the word "bioinformatics" for the first time. I read about her on Wikipedia and went.

Then a whole group of newcomers was recruited into the division, and we studied bioinformatics together. We started by repeating the school curriculum about DNA and RNA, then we analyzed in detail the problems existing in bioinformatics, approaches to their solution and algorithms, and learned to work with specialized software. "

“I am a biophysicist by education, in 2012 I defended my PhD in genetics. For some time I worked in science, was engaged in research - and I continue to this day. When the opportunity arose to apply scientific knowledge in production, I immediately seized on it.

As a business analyst, I have a very specific job. For example, financial issues pass me by, I'm more of a subject area expert. I have to understand what the customers want from us, figure out the problem and draw up high-level documentation - a task for programmers, sometimes making a working prototype of a program. As the project progresses, I keep in touch with developers and customers so that both are confident that the team is doing what is required of it. In fact, I am a translator from the language of customers - biologists and bioinformatics - into the language of developers and vice versa. "

How the genome is read

To understand the essence of EPAM bioinformatics projects, you first need to understand how the genome is sequenced. The fact is that the projects that we will talk about are directly related to reading the genome. Let's turn to bioinformatics for an explanation.

Mikhail Alperovich, head of the bioinformatics unit:

“Imagine that you have ten thousand copies of War and Peace. You passed them through a shredder, mixed them well, pulled out a bunch of paper strips at random from this heap and are trying to collect the source text from them. In addition, you have the manuscript for War and Peace. The text that you collect will need to be compared with it in order to catch typos (and they will certainly be). Modern sequencer machines read DNA in much the same way. DNA is isolated from cell nuclei and divided into fragments of 300-500 base pairs (we remember that in DNA nucleotides are linked to each other in pairs). Molecules crush because no modern machine can read the genome from start to finish. The sequence is too long and errors accumulate as you read it.

We remember "War and Peace" after the shredder. To restore the original text of the novel, we need to read and arrange all the pieces of the novel in the correct order. It turns out that we read the book several times in tiny fragments. It's the same with DNA: the sequencer reads each part of the sequence with multiple overlap - after all, we are analyzing not one, but many DNA molecules.

The obtained fragments are aligned - each of them is “applied” to the reference genome and they try to understand to which part of the reference the read fragment corresponds. Then, variations are found in the aligned fragments - meaningful differences between readings from the reference genome (misprints in the book compared to the reference manuscript). This is done by programs - variant callers (from the English variant caller - a mutation detector). This is the most difficult part of the analysis, therefore there are many different programs - variant-callers and they are constantly being improved and new ones are being developed.

The overwhelming majority of the mutations found are neutral and do not affect anything. But there are also those in which the predisposition to hereditary diseases or the ability to respond to different types of therapy are encrypted. "

For analysis, a sample is taken in which there are many cells - and therefore copies of the complete set of cell DNA. Each small piece of DNA is read several times to minimize the chance of error. If even one significant mutation is missed, the patient may be misdiagnosed or treated inappropriately. Reading each piece of DNA one time is too little: a single reading may be wrong, and we will not know about it. If we read the same passage twice and get one correct and one incorrect result, it will be difficult for us to know which reading is true. And if we have a hundred readings and in 95 of them we see the same result, we understand that it is correct.

Gennady Zakharov:

“To analyze cancer diseases, you need to sequence both healthy and diseased cells. Cancer results from mutations that a cell accumulates over the course of its life. If the mechanisms responsible for its growth and division have deteriorated in the cell, then the cell begins to divide indefinitely, regardless of the needs of the body, that is, it becomes a cancerous tumor. To understand what exactly causes the cancer, a sample of healthy tissue and cancer is taken from the patient. Both samples are sequenced, the results are compared, and they find out how one differs from the other: what molecular mechanism has broken down in the cancer cell. Based on this, a drug is selected that is effective against cells with “breakage”. "

Bioinformatics: production and open source

The bioinformatics division at EPAM has both production and open source projects. Moreover, a part of a production project can grow into an open source project, and an open source project can become a part of production (for example, when an open source EPAM product needs to be integrated into the client's infrastructure).

Project # 1: option-caller

For one of its clients, a large pharmaceutical company, EPAM modernized the variant-caller program. Its peculiarity is that it is able to find mutations that are inaccessible to other similar programs. The program was originally written in Perl and had complex logic. In EPAM, the program was rewritten in Java and optimized - now it works 20, if not 30 times faster.

The source code of the program is available on GitHub.

Project # 2: 3D molecule viewer

There are many desktop and web applications for visualizing molecular structure in 3D. Imagining how a molecule looks in space is extremely important, for example, in drug development. Suppose we need to synthesize a drug that has a targeted effect. First, we need to design the molecule for this drug and make sure that it interacts with the right proteins exactly as it should. In life, molecules are three-dimensional, so they are also analyzed in the form of three-dimensional structures.

For 3D viewing of molecules, EPAM made an online tool that initially worked only in a browser window. Then, on the basis of this tool, a version was developed that allows visualizing molecules in HTC Vive virtual reality glasses. The glasses are equipped with controllers with which a molecule can be rotated, moved, substituted to another molecule, and individual parts of a molecule can be rotated. Doing all this in 3D is much more convenient than on a flat screen monitor. This part of the EPAM bioinformatics project was done jointly with the Virtual Reality, Augmented Reality and Game Experience Delivery division.

The program is just getting ready for publication on GitHub, but for now there is a demo version of it.

You can learn how to work with the application from the video.

Project # 3: NGB Genomic Browser

The genomic browser visualizes individual DNA reads, variations, and other information generated by genome analysis utilities. When the readings are compared with the reference genome and the mutations are found, the scientist is left to check whether the machines and algorithms worked correctly. It depends on how accurately the mutations in the genome are identified, which diagnosis will be made to the patient or what treatment he will be prescribed. Therefore in clinical diagnosis a scientist should control the operation of machines, and a genomic browser helps him in this.

For bioinformatics developers, the genomic browser helps analyze complex cases to find errors in algorithms and understand how they can be improved.

The new genomic browser NGB (New Genome Browser) from EPAM works on the web, but in terms of speed and functionality it is not inferior to the desktop counterparts. This is a product that was lacking in the market: previous online tools were slower and could do less than desktop ones. Nowadays, many customers choose web applications for security reasons. The online tool allows you not to install anything on the scientist's work computer. You can work with it from anywhere in the world by going to the corporate portal. A scientist does not have to carry a work computer with him everywhere and download all the necessary data to it, of which there can be a lot.

Gennady Zakharov, business analyst:

“I worked on open source utilities partly as a customer: I set a task. I studied the best solutions on the market, analyzed their advantages and disadvantages, and looked for ways to improve them. We needed to make web solutions no worse than their desktop counterparts and at the same time add something unique to them.

In the 3D molecule viewer, it was working with virtual reality, and in the genomic browser, it was an improved work with variations. Mutations can be complex. Reconstruction in cancer cells sometimes affects huge areas. Extra chromosomes appear in them, chunks of chromosomes and whole chromosomes disappear or merge in random order. Individual pieces of the genome can be copied 10–20 times. Such data, firstly, is more difficult to obtain from readings, and secondly, it is more difficult to visualize.

We have developed a visualizer that correctly reads information about such extended restructuring. We also made a set of visualizations, which, upon contact of chromosomes, shows whether hybrid proteins were formed due to this contact. If the extended variation affects several proteins, we can, on click, calculate and show what happens as a result of such variation, which fusion proteins are obtained. In other visualizers, scientists had to track this information manually, and in NGB - with one click. "

How to study bioinformatics

We have already said that bioinformatics are hybrid specialists who must know both biology and computer science. Self-education plays an important role in this. Of course, EPAM has an introductory course in bioinformatics, but it is designed for employees who will use this knowledge on a project. Classes are held only in St. Petersburg. And yet, if bioinformatics is of interest to you, there is an opportunity to learn:

Bioinformatics has become an important part of many areas of biology. In experimental molecular biology, bioinformatics techniques such as imaging and signal processing provide useful results from a large amount of raw data. In genetics and genomics, bioinformatics assists in the sequencing and annotation of genomes and observed mutations. It plays a role in the analysis of data from the biological literature and the development of biological and genetic ontologies for organizing and querying biological data. It plays a role in gene analysis, protein expression and regulation. Bioinformatics tools help in comparing genetic and genomic data and, in general, in understanding the evolutionary aspects of molecular biology. In general terms, it helps to analyze and catalog biological pathways and networks, which are an important part of systems biology. In structural biology, it assists in the simulation and modeling of DNA, RNA and protein structures, as well as molecular interactions.

History

Based on the recognition of the important role of transmission, storage and processing of information in biological systems, in 1970 Pauline Hogeweg introduced the term "bioinformatics", defining it as the study of information processes in biotic systems. This definition parallels bioinformatics with biophysics (the study of physical processes in biological systems) or with biochemistry (the study of chemical processes in biological systems).

At the beginning of the "genomic revolution" the term "bioinformatics" was rediscovered and meant the creation and maintenance of a database for storing biological information

Sequences... Computers became essential in molecular biology when protein sequences became available after Frederic Sanger sequenced insulin in the early 1950s. Comparing multiple sequences by hand has proven to be impractical. The pioneer in this field was Margaret Oakley Dayhoff. David Lipman (director of the National Center for Biotechnology Information) called her "the mother and father of bioinformatics." Dayhof compiled one of the first protein sequence databases, originally published as books, and pioneered methods for sequence alignment and molecular evolution.

Genomes... As complete genome sequences became available, again with the pioneering work of Frederick Sanger, the term bioinformatics was rediscovered to denote the creation and maintenance of databases for storing biological information such as nucleotide sequences (GenBank database in 1982). The creation of such databases included not only design issues, but also the creation of an integrated interface that would allow researchers to query existing data and add new ones. With the public availability of data, tools for data processing were quickly developed and described in journals such as Research on Nucleic Acids, which published specialized topics on bioinformatics tools as early as 1982.

Objectives

The main purpose of bioinformatics is to promote understanding of biological processes. Bioinformatics differs from other approaches in that it focuses on creating and applying computationally intensive methods to achieve this goal. Examples of such techniques include pattern recognition, data mining, machine learning algorithms, and biological data visualization. The main efforts of researchers are focused on solving the problems of sequence alignment, finding genes (finding the DNA region encoding genes), decoding the genome, designing drugs, developing drugs, aligning protein structure, predicting protein structure, predicting gene expression and protein-protein interactions, whole genome search for associations and modeling of evolution.

Bioinformatics today implies the creation and improvement of databases, algorithms, computational and statistical methods and theory to solve practical and theoretical problemsarising from the management and analysis of biological data.

Main research areas

Analysis of genetic sequences

Biodiversity assessment

Basic bioinformatics programs

ACT (Artemis Comparison Tool) - genomic analysis
Arlequin - analysis of population genetic data
Bioconductor is a large-scale FLOSS project providing many individual packages for bioinformatics research. Written in.
BioEdit
BioNumerics is a commercial universal software package
BLAST - search for related sequences in the database of nucleotide and amino acid sequences
Clustal - multiple nucleotide and amino acid sequence alignment
DnaSP - DNA sequence polymorphism analysis
FigTree - phylogenetic tree editor
Genepop
Genetix - population genetic analysis (the program is available only in French)
JalView - editor for multiple alignment of nucleotide and amino acid sequences
MacClade is a commercial software for interactive evolutionary data analysis
MEGA - Molecular Evolutionary Genetic Analysis
Mesquite is a program for comparative biology in Java
Muscle - multiple comparisons of nucleotide and amino acid sequences. Faster and more accurate than ClustalW
PAUP - phylogenetic analysis using the parsimony method (and other methods)
PHYLIP - phylogenetic software package
Phylo_win - phylogenetic analysis. The program has a graphical interface.
PopGene - analysis of genetic diversity of populations
Populations - population genetic analysis
PSI Protein Classifier - generalization of the results obtained using the PSI-BLAST program
Seaview - Phylogenetic Analysis (GUI)
Sequin - sequence deposition at GenBank, EMBL, DDBJ
SPAdes - Bacterial Genome Collector
SplitsTree - a program for building phylogenetic trees
T-Coffee - Multiple progressive alignment of nucleotide and amino acid sequences. More sensitive than ClustalW / ClustalX.
UGENE is a free Russian-language tool, multiple alignment of nucleotide and amino acid sequences, phylogenetic analysis, annotation, work with databases.

[Video] [Slides]

The revolution in nuclear physics led many years ago to the accumulation of a huge amount of data that had to be stored and processed. It turned out to be possible only for computers, and after them, and super-computers.

The boom of genomics of the last 10-15 years has continued this tradition and multiplied it: biomedical research concerns each of us, which means that more and more data will be produced, especially in light of the idea of \u200b\u200bpersonalized medicine and the requirements of big pharma. There is nothing to do without computer knowledge and software products. But in addition, you need to know well what to study, how, how to analyze the data and how much you can trust it. How to store and handle. Where to apply and where to use.

Most of these "hows" are covered in the lecture. Alla aims to tell about the importance and breadth of bioinformatics applications.

2. Mutation process and methods of its study (Alexey Kondrashov, Moscow State University)

[Video] [Slides]

The mutation process is the first of two essential factors in Darwinian evolution. The lecture discusses the causes and mechanisms of mutations, methods for measuring the parameters of the mutation process at short, medium and long times, data on mutation rates and the simplest models of the effect of mutation on the genetic structure of the population.

3. Natural selection and methods of its study (Alexey Kondrashov, Moscow State University)

[Video] [Slides]

Natural selection is the second of two necessary factors in Darwinian evolution. The lecture discusses the causes and mechanisms of the emergence of selection, methods and parameters used to describe and study it, data on selection in nature and the simplest models of the influence of selection on the population.

4. Child development and bioinformatics: problems and solutions (Elena Grigorenko, Yale University)

[Video] [Slides]

The lecture tells about several “junctions” of development sciences and bioinformatics.
The problems of prenatal diagnosis and prenatal sequencing, as well as exome sequencing of newborns are discussed.

The article deals with the study of the influence of the early developmental environment on the state of methylome and the genomic etiology of childhood developmental disorders. In conclusion, ethical issues related to the use of genomic information in making diagnostic and individualized decisions about child development are considered.

5. Sequencing of a new generation: principles, opportunities and prospects (Maria Logacheva, Moscow State University)

[Video] [Slides]

Next Generation Sequencing (NGS) has transformed many areas of biological and biomedical research. It makes it possible to obtain relatively quickly and inexpensively sequences of genes and genomes of previously unexplored species, as well as, using the material of a large number of individuals of the same species, to reveal intraspecific variability, to search for genes associated with the traits of interest. In addition to the actual determination of genome sequences, NGS allows for detailed analysis of gene expression in different tissues of the body or under different conditions, and is widely used in epigenetic studies.

The lecture provides an overview of the main sequencing methods, their physicochemical principles, features of sample preparation, characteristics of the data obtained, their cost and typical mistakes... Special attention is paid to the applicability of different methods for solving biological problems, and recommendations are given for planning experiments related to NGS.

6. Structural biology of protein: a review of problems and approaches (Pavel Yakovlev, BIOCAD)

[Video] [Slides]

Using only primary sequences allows you to solve most of the issues related to nucleic acids (DNA and RNA). When studying the functions of proteins, knowledge of only the primary sequence no longer allows solving most problems. Which proteins will interact with each other and how strongly? Will the replacement of the amino acid change the function of the protein? How to remove the side effects of medicinal protein or increase its effectiveness? The field of bioinformatics, which develops algorithms for modeling the spatial shape of proteins and their interactions, is called upon to answer these questions.

7. De novo assembly of transcriptomes (Artem Kasyanov, MIPT)

[Video] [Slides]

In connection with the significant reduction in the cost and increase in the productivity of technologies, the number of projects devoted to de novo sequencing of genomes of non-model organisms has increased significantly. In some cases, de novo sequencing and assembly of the genome is difficult - for example, in the case of its significant size. In such cases, they resort to studying the transcriptome. Also, de novo analysis of the transcriptome may be necessary in the case of studying species with a large number of alternatively spliced \u200b\u200bgenes, since even in the presence of a genome it is rather difficult to determine the complete list of isoforms.

The lecture is devoted to the issues of assembly of transcriptome data in the absence of a genome. Topics such as splice graphs, trinity and newbler programs, comparison and analysis of assemblies, assembly of transcriptomes of polyploid organisms are considered.

8. Evolution of genome assembly algorithms (Anton Bankevich, SPbAU RAS)

[Video] [Slides]

At the moment, there have been several generations of DNA sequencing methods. However, new technologies are meaningless without algorithms that can process their results. Constantly emerging new sequencing methods pose new algorithmic challenges. One of the most important such tasks is genome assembly. The lecture tells about the evolution of sequencing methods and algorithmic approaches to genome assembly that have arisen and continue to arise with each step of this evolution.

9. Introduction to molecular biology and genetics (Pavel Dobrynin, St. Petersburg State University)

[Video] [Slides]

The lecture is devoted to the structure and organization of DNA in prokaryotes and eukaryotes, molecular mechanisms responsible for the preservation and reproduction of genetic material. The main mechanisms behind genetic variability and options for the implementation of genetic material are analyzed.

10. The problem of multiple local alignment and construction of synthesized blocks (Ilya Minkin, Pennsylvania State University)

[Video] [Slides]

The lecture discusses two similar algorithmic problems in comparative genomics: multiple local alignment and the construction of synthesized blocks. These algorithms play a critical role in comparing complete genomic sequences. Described the setting of problems and the basic ideas on which some modern algorithms are built.

11. Why and how to make presentations (Andrey Afanasyev, iBinom)

[Video] [Slides]

The lecture discusses the types of presentations, why they are really needed, and tells how to speak so that the audience understands everything and does not fall asleep, as well as what mistakes should be avoided and who to take an example from when preparing your speech.

12. Business in bioinformatics (Andrey Afanasyev, iBinom)

[Video] [Slides]

The lecture tells what bioinformatics companies exist in Russia and in the world, who created them and what exactly they make money on.
Plans of major players and industry trends were discussed.

In the final part of the lecture, Andrey gives food for thought about organizing his own startup or choosing a new job.

13. Perspectives and problems of systems biology (Ilya Serebriysky, Fox Chase Cancer Center)

[Video] [Slides]

The lecture is intended to give a general idea of \u200b\u200bthe systemic properties of biological objects. Ilya Serebriysky talks about the main components of systems biology, about interactomics and building models, about the main problems in systems biology and attempts to solve them. Some advances in systems biology (mainly in the field of oncology) are discussed. Also discussed are the public resources for systems biology (TCGA / cBioPortal, CCLE).

14. Laboratory for Systems Biology (Ilya Serebriysky, Fox Chase Cancer Center)

[Video] [Slides]

This lesson is devoted to building communication networks based on public databases. Used such databases and web services as Entrez, GeneMANIA, BioGRID and others. Various methods of visualization of interaction networks are considered, in particular, using the Cytoscape program.

15. Metagenomics (Alla Lapidus, SPbAU RAS)

[Video] [Slides]

Microbes are everywhere, microbes rule the world, but far from all of them we can get to know in laboratory conditions. The vast majority of them, we do not know how to grow, which means that they must be somehow removed from their natural habitat - land, water, from under the roots of trees, etc., where they live in large groups.

Metagenomics helps in these highly confusing studies. It also helps to feed, warm, heal people and catch criminals. All of this and bioinformatics in metagenomics was the subject of this lecture.

16. The problem of testing a set of statistical hypotheses (Anton Korobeinikov, St. Petersburg State University, St. Petersburg Academy of Sciences)

[Video] [Slides]

The lecture deals with the classical problem of testing multiple hypotheses simultaneously. Problems of this kind are often encountered, for example, in a genome-wide search for associations or analysis of microarray data. Possible solutions to this problem are considered, ranging from the classical Bonferroni approach and ending with methods that allow you to control the FDR (false discovery rate).

17. How to use statistics correctly and incorrectly (Nikita Alekseev, St. Petersburg State University, George Washington University)

[Video] [Slides]

The lecture is devoted to errors in the use of statistics and how to prevent them. In particular, the answer is given to the question: in what situations can standard criteria be used to compare typical representatives of the sample, and what to do if the standard criteria do not fit?

18. Mathematical models of gene expression regulation (Maria Samsonova, SPbSPU)

[Video] [Slides]

Understanding the subtle mechanisms of regulation of gene activity is a necessary condition for deciphering the mechanisms of the onset of diseases in humans. Unfortunately, to date, there is no such understanding: we cannot satisfactorily explain how the groups of transcription factors interact with each other, with chromatin proteins, other adapter proteins and the RNA - polymerase complex, nor how and why this or that part of the DNA sequence can control a complex, spatially limited and time-dependent pattern of gene expression.

Mathematical modeling helps to understand the mechanisms of gene regulation by mechanistic and quantitative description of this process. The lecture discusses two of the most common approaches to modeling gene expression - based on nonlinear equations of reaction ‒ diffusion and thermodynamic equilibrium. The stages of building such models are considered in sequence and examples of their use for generating new knowledge are given.

19. Semi-local and local sequence alignment (Alexander Tiskin, University of Warwick)

[Video] [Slides]

The calculation of the longest common subsequence (LCS) of two strings is one of the classic algorithmic problems that has wide applications in both computer science and computational biology, where it is known as "global sequence alignment." Many applications need a generalization of this problem, which we call the calculation of semi-local LCS, or "semi-local alignment." In this case, it is required to calculate the LCS between a string and all substrings of another string, and / or between all prefixes of one string and all suffixes of another. In addition to the important role of this generalized problem in string algorithms, it has unexpected connections with semigroup algebra and computational geometry, with comparison networks, and practical applications in computational biology. In addition, the problem of computing semi-local LCS can be used as a flexible and efficient approach to (completely) local alignment of biological sequences.

The lecture presents an effective solution to the problem of computing a semilocal LCS and gives an overview of the main related results and applications. These include dynamic support for LCS; quick calculation of clicks in some special graphs; quick comparison of compressed strings; parallel computing on strings.

20. Analysis of families of molecular sequences (Sergey Nurk, SPbAU RAS)

[Video] [Slides]

When solving a variety of problems, from the search for regulatory motives to predicting the functions of proteins, bioinformatics have to work with entire families of evolutionarily related nucleotide or amino acid sequences. The lecture discusses different ways representations of such families used in popular bioinformatics tools and databases. It is described how to decipher the PROSITE pattern and interpret the sequence logo, what is the difference between the HMM profile and the PSSM, and also how to avoid errors in their construction and analysis of the results.

21. Epigenomics, RNA and all that (Andrey Mironov, IITP RAS)

[Video] [Slides]

The lecture provides an overview of the concept of epigenetics. Levels considered structural organization chromatin, described various epigenomic modifications: histone modifications, methylation of CpG motifs. Their influence on gene expression is discussed.
The role of epigenomic modifications in splicing, imprinting, etc. is also considered.

The XIST system (X-inactivation specific transcript), antisense RNA, splicing, RNA-dependent regulation is described.
Models for studying epigenomic modifications are also considered.

22. Quality control of NGS data (Konstantin Okonechnikov, Max Planck Institute for Infection Biology)

[Video] [Slides]

The lecture describes the sequencing errors typical for NGS technologies. Examples of such errors are PCR amplification, sequence-specific reading errors, uneven distribution of GC composition, and others. Various methods for assessing these errors and taking them into account in the analysis are discussed. The question of practical methods of solution and existing software tools is touched upon.

23. NGS data quality control, workshop (Konstantin Okonechnikov, Max Planck Institute for Infection Biology)

[Video] [Slides]

During the workshop, participants learned how to apply programming skills to control the quality of NGS data. The BAM / SAM data formats, pysam and pyplot libraries, fundamental concepts were considered. In particular, examples of calculating the GC-composition, estimating the frequency of duplications, distributing the length of the insert, calculating the coverage in windows are analyzed.

24. Practical RNA sequencing (Konstantin Okonechnikov, Max Planck Institute for Infection Biology)

[Video] [Slides 1] [Slides 2]

The practical task of analyzing RNA sequencing data was discussed at the seminar.
The following methods were discussed and demonstrated in the format of presentation and practice: read alignment, initial quality control, pipelines for studying the expression of DESeq and Cufflinks genes, finding transcript isoforms, searching for hybrid genes.

25. Bioinformatic approaches to the study and treatment of cancer on the example of lung cancer (Maria Shutova, IOGEN RAS)

[Video] [Slides]

Cancer is one of the most common and dangerous diseases. It is called the "disease of the genome" for the enormous contribution of accumulated and new mutations to its emergence and development. It is known that not only the state of the genome, but also the transcriptional and even epigenetic status of primary cancer cells, as well as the complex homeostasis of a growing tumor, directly affect its properties and, most importantly, the susceptibility to therapy. Bioinformatics provides the only way to understand this tangle of interdependent factors. The lecture discusses the main questions related to the study of tumor formation, and possible ways to answer them using bioinformatic approaches.

26. New omics in human biology: metabolomics and lipidomics (Philip Haytovich, Skoltech)

[Video] [Slides]

Sequencing the human genome, studying human genetic variation, sequencing the human metagenome, transcriptome analysis of human tissues - all these biological methods in the appendix to "big data" gave scientists a wealth of valuable information about what distinguishes humans from other animals.

This lecture is devoted to new "omics" that allow you to answer questions about the human body in the study of the brain and other tissues - metabolomics and lipidomics.

27. Genomic assembly: a look into tomorrow (Andrey Przhibelsky, SPbAU RAS)

[Video] [Slides]

AT last years next-generation sequencing technologies have taken a significant step forward: IonTorrent and Pacific Biosciences appeared, Ilumina created a number of new protocols. But, as it turns out, all this is not enough to consider the problem of genome assembly solved. It usually takes dozens of different specialists, hundreds of thousands of dollars, and years of work to go from DNA extraction to a fully completed genome. Therefore, today this task remains relevant both from the point of view of biotechnology and from the point of view of bioinformatics. The lecture discusses the latest breakthroughs in genome assembly methods, the latest data types, which, perhaps, will allow this problem to be brought to new level, and the prospects for genomics in the near future.

Instead of a conclusion

education

summer school

Add tags

Can demonstrate similarity in protein function or relationships between species (thus phylogenetic trees can be constructed). With the increasing amount of data, it has long been impossible to manually analyze sequences. Nowadays, computer programs are used to search the genomes of thousands of organisms consisting of billions of base pairs. Programs can uniquely match (align) similar DNA sequences in genomes different types; often, such sequences have similar functions, and differences arise as a result of small mutations, such as substitutions of individual nucleotides, insertions of nucleotides, and their "dropouts" (deletions). One of the variants of this alignment is used during the sequencing process itself. The so-called "fractional sequencing" technique (which was, for example, used by the Institute for Genetic Research to sequence the first bacterial genome, Haemophilus influenzae) instead of the complete sequence of nucleotides gives a sequence of short DNA fragments (each about 600-800 nucleotides long). The ends of the fragments overlap and, properly aligned, give the complete genome. This method quickly yields sequencing results, but assembling the fragments can be quite challenging for large genomes. In a project to decipher the human genome, the assembly took several months of computer time. Now this method is used for almost all genomes, and genome assembly algorithms are one of the most pressing problems in bioinformatics today.

Another example of application of computer sequence analysis is the automatic search for genes and regulatory sequences in the genome. Not all nucleotides in the genome are used to sequence proteins. For example, in genomes higher organismsHowever, large segments of DNA do not explicitly encode proteins and their functional role is unknown. The development of algorithms for identifying protein-coding regions of the genome is an important task of modern bioinformatics.

Bioinformatics helps link genomic and proteomic projects, for example, by helping to use DNA sequences to identify proteins.

Annotation of genomes

Biodiversity assessment

Basic bioinformatics programs

ACT (Artemis Comparison Tool) - genomic analysis
Arlequin - analysis of population genetic data
BioEdit
BioNumerics is a commercial universal software package
BLAST - search for related sequences in the database of nucleotide and amino acid sequences
Clustal - multiple nucleotide and amino acid sequence alignment
DnaSP - DNA sequence polymorphism analysis
FigTree - phylogenetic tree editor
Genepop
Genetix - population genetic analysis (the program is available only in French)
JalView - editor for multiple alignment of nucleotide and amino acid sequences
MacClade is a commercial software for interactive evolutionary data analysis
MEGA - Molecular Evolutionary Genetic Analysis
Mesquite - Java Comparative Biology Program
Muscle - multiple comparisons of nucleotide and amino acid sequences. Faster and more accurate than ClustalW
PAUP - phylogenetic analysis using the parsimony method (and other methods)
PHYLIP - phylogenetic software package
Phylo_win - phylogenetic analysis. The program has a graphical interface.
PopGene - analysis of genetic diversity of populations
Populations - population genetic analysis
PSI Protein Classifier - generalization of the results obtained using the PSI-BLAST program
Seaview - Phylogenetic Analysis (GUI)
Sequin - sequence deposition at GenBank, EMBL, DDBJ
SPAdes - Bacterial Genome Collector
T-Coffee - Multiple progressive alignment of nucleotide and amino acid sequences. More sensitive than ClustalW / ClustalX.
UGENE is a free Russian-language tool, multiple alignment of nucleotide and amino acid sequences, phylogenetic analysis, annotation, work with databases.
Velvet - genome collector

Bioinformatics and Computational Biology

Bioinformatics is understood as any use of computers to process biological information. In practice, sometimes this definition is narrower; it means the use of computers for processing experimental data on the structure of biological macromolecules (proteins and nucleic acids) in order to obtain biologically significant information. In light of the change in the cipher of scientific specialties (03.00.28 "Bioinformatics" turned into 03.01.09 "Mathematical biology, bioinformatics"), the field of the term "bioinformatics" has expanded and includes all implementations of mathematical algorithms associated with biological objects.

Terms bioinformatics and "computational biology" are often used synonymously, although the latter is more likely to refer to the development of algorithms and specific computational methods. It is believed that not every use of computational methods in biology is bioinformatics, for example, mathematical modeling of biological processes is not bioinformatics.

Bioinformatics uses methods from applied mathematics, statistics and computer science. Research in computational biology often overlaps with systems biology. The main efforts of researchers in this area are aimed at studying genomes, analyzing and predicting the structure of proteins, analyzing and predicting the interactions of protein molecules with each other and other molecules, and reconstruction of evolution.

Bioinformatics and its methods are also used in biochemistry, biophysics, ecology and other fields. The main line in bioinformatics projects is the use of mathematical tools to extract useful information from "noisy" or too voluminous data about the structure of DNA and proteins obtained experimentally.

Structural Bioinformatics

Structural bioinformatics includes the development of algorithms and programs for predicting the spatial structure of proteins. Research topics in structural bioinformatics:

X-ray structural analysis (XRD) of macromolecules
Quality indicators of a model of a macromolecule constructed from XRD data
Algorithms for calculating the surface of a macromolecule
Algorithms for finding the hydrophobic core of a protein molecule
Algorithms for Finding the Structural Domains of Proteins
Spatial alignment of protein structures
Structural classifications of SCOP and CATH domains
Molecular dynamics

Notes

Protein bioinformatics - * bialkovaya biyainfarmatika * protein bioinformatics analysis of protein superfamilies by bioinformatics methods and experimental research to develop strategies in the field of protein bioengineering. This analysis is used to clarify the role ... ... Genetics. encyclopedic Dictionary

Bacterial bioinformatics - * bacterial bioinformatics * bacterial bioinformatics use of computer screening methods for sequenced genomes of pathogens for the development of antimicrobial drugs. Antibiotic resistance among virulent species is increasing, ... ... Genetics. encyclopedic Dictionary

Cellular Bioinformatics - * cellular bioinformatics * cellular bioinformatics a small section of bioinformatics (see), focused on the study of the functioning of living cells with the involvement of all available data on DNA, mRNA, proteins and metabolic processes. One of… … Genetics. encyclopedic Dictionary

Medical Bioinformatics - * medytsynska b_ya_infarmatika * medical bioinformatics is a scientific discipline that uses bioinformatics methods (see) in medicine ... Genetics. encyclopedic Dictionary

Isolation of DNA by alcohol precipitation. DNA looks like a ball of white threads ... Wikipedia

Biological informatics) is a set of methods and approaches, including: mathematical methods of computer analysis in comparative genomics (genomic bioinformatics); development of algorithms and programs for predicting the spatial structure of proteins (structural bioinformatics), researching strategies and creating computational methodologies for controlling biological systems.

Bioinformatics uses the methods of applied mathematics, statistics and informatics. Bioinformatics is used in biochemistry, biophysics, ecology, and other areas of fundamental science. This science appeared in 1970, when, relying on the recognition of the important role of the transmission, storage and processing of information in biological systems, Pauline Hogeweg introduced this term, defining it as the study of information processes in biotic systems.

As examples of biological information processes that were studied in the early years of bioinformatics, one can cite complex structures of social interaction according to simple behavioral rules, as well as storage and maintenance of information in models of biogenesis and abiogenesis.

At the beginning of the genomic revolution, the term “bioinformatics” was rediscovered and meant the creation and maintenance of a database for storing biological information such as nucleotide sequences. The creation of such databases included the creation of a comprehensive interface that allowed researchers to query existing data and add new ones.

The main goal of bioinformatics is to promote understanding of biological processes. Bioinformatics differs from other approaches in that it focuses on creating and applying computationally intensive methods to achieve this goal. Examples of such techniques include pattern recognition, machine learning algorithms, and visualization of biological data. The main efforts of researchers are aimed at solving the problems of sequence alignment, finding genes (finding the DNA region encoding genes), decoding the genome, designing drugs, developing drugs, aligning protein structure, predicting protein structure, predicting gene expression and protein-protein interactions, whole genome search for associations and modeling of evolutionary processes. Bioinformatics today implies the creation and improvement of databases, algorithms, computational and statistical methods
and theories for solving many practical and theoretical problems arising in the management of biological processes and analysis of biological data. Thus, modern genetics, evolutionary biology, computational biology and other information-intensive branches of fundamental biology need methods of informatization and computerization, algorithmization and programming, information technology, without which the data processing itself is unthinkable.

Analysis of genetic sequences. Since the Phi – X174 phage was deciphered (sequenced) in 1977, the DNA sequences of an increasing number of organisms have been deciphered and stored in databases. These data are used to determine the sequences of proteins and regulatory sites.

Comparison of genes within the same or different species can demonstrate similarity in protein function or relationships between species (thus phylogenetic trees can be constructed). With the increasing amount of data, it has long been impossible to manually analyze sequences. Nowadays, computer programs are used to search genomes for thousands of organisms, consisting of billions of base pairs. Programs can unambiguously match similar DNA sequences in genomes of different species; often such sequences have similar functions, and differences arise from small mutations, such as substitutions of individual nucleotides, insertion of nucleotides, and their deletion (deletion). One of the variants of this alignment is used during the sequencing process itself.

Fractional sequencing technique was used by the Institute genetic research for decoding (sequencing) of the first bacterial genome, instead of the complete sequence of nucleotides, it gives sequences of short DNA fragments (each about 600-800 nucleotides long). The ends of the fragments overlap and align to form a complete genome. This method quickly yields sequencing results, but fragment assembly can be very challenging for large genomes. In a project to decode the human genome, the assembly took several months of computer time. Now this method is applied to all genomes, and genome assembly algorithms are one of the most pressing problems in bioinformatics today.

Another example of computer analysis of genetic sequences is the automatic search for genes and regulatory sequences in the genome. Not all nucleotides in the genome are used to sequence proteins. In the genomes of higher organisms, large segments of DNA do not encode proteins, and their functional role is unknown. The development of algorithms for identifying protein-coding regions of the genome is an important task of modern bioinformatics.

Bioinformatics helps link genomic and proteomic projects by helping to identify proteins in a DNA sequence.

Annotation of genomes. In the genomics context, annotation is the process of marking genes and other objects in a DNA sequence. The first genome annotation software system was created in 1995 by Owen White, who worked in the team that sequenced and analyzed the first decoded genome of a free-living organism, bacteria. Dr. White built a system for finding genes, RNA and other DNA objects, and made the first notation of the functions of these genes. Most modern systems work in a similar way, and these programs are constantly evolving and improving.

Computational evolutionary biology. Evolutionary biology examines the origin and emergence of species, as well as their development over time. Biological informatics helps evolutionary biologists and geneticists in several ways:

Study the evolution of the whole variety of organisms living on Earth, measuring DNA changes in them;

Compare entire genomes, allowing for the study of complex evolutionary events occurring in biological history Earths: gene duplication, lateral gene transfer, bacterial factors;

Build computer models of biological populations to study the development of a biological system in time;

Track publications on the evolution of a large number of species.

The field of computer science, which uses genetic algorithms to solve biological problems, is also associated with computer evolutionary biology. Work in this area uses specialized software to improve algorithms and computations. The research principle is based on evolutionary methods and principles such as replication, diversification, recombination, mutation, survival in natural selection.

Assessment of biological diversity. The biological diversity of an ecosystem can be defined as the complete genetic totality of a certain environment, consisting of all living species, be it a biofilm in an abandoned mine, a drop of sea water, a handful of earth, or the entire biosphere of planet Earth.

Databases are used to collect species names, descriptions, distribution areas, and genetic information. Specialized software is used to search, visualize and analyze information. Computer simulators simulate population dynamics, calculate the overall genetic health of a biological culture in agronomy.

One of the most important potentials of this area lies in the analysis of DNA sequences or complete genomes of entire endangered species, making it possible to memorize the results of a genetic experiment of nature in a computer and can be used again in the future, even if these species completely disappear.

Methods for assessing other components of biodiversity - taxa (primarily species) and ecosystems - often fall out of the area of \u200b\u200bconsideration of bioinformatics. Currently, the mathematical foundations of bioinformatic methods for taxa are presented in the framework of such a scientific direction as phenetics, or numerical taxonomy. Methods for analyzing the structure of ecosystems are considered by specialists in such areas as systems ecology, biocenometry.

Bioinformatics and Computational Biology. Bioinformatics is understood as any use of computers to process biological information. It is understood as the use of computers for processing experimental data on the structure of biological macromolecules (proteins and nucleic acids) in order to obtain biologically significant information. The terms bioinformatics and computational biology refer to the design of algorithms and specific computational techniques. The use of computational methods in biology is also associated with mathematical modeling of biological processes.

Bioinformatics uses methods from applied mathematics, statistics and computer science. Research in computational biology overlaps with systems biology. The main efforts of researchers are aimed at studying genomes, analyzing and predicting the structure of proteins, interactions of protein molecules with each other and other molecules, which is necessary for the reconstruction of evolutionary processes.

Biological informatics. What bioinformatics can do

Why biology has ceased to cope without informatics and what does cancer have to do with it

Bioinformatics at EPAM

How bioinformatics become

How the genome is read

Bioinformatics: production and open source

Project # 1: option-caller

Project # 2: 3D molecule viewer

Project # 3: NGB Genomic Browser

How to study bioinformatics

History

Objectives

Main research areas

Analysis of genetic sequences

Biodiversity assessment

Basic bioinformatics programs

2. Mutation process and methods of its study (Alexey Kondrashov, Moscow State University)

3. Natural selection and methods of its study (Alexey Kondrashov, Moscow State University)

4. Child development and bioinformatics: problems and solutions (Elena Grigorenko, Yale University)

5. Sequencing of a new generation: principles, opportunities and prospects (Maria Logacheva, Moscow State University)

6. Structural biology of protein: a review of problems and approaches (Pavel Yakovlev, BIOCAD)

7. De novo assembly of transcriptomes (Artem Kasyanov, MIPT)

8. Evolution of genome assembly algorithms (Anton Bankevich, SPbAU RAS)

9. Introduction to molecular biology and genetics (Pavel Dobrynin, St. Petersburg State University)

10. The problem of multiple local alignment and construction of synthesized blocks (Ilya Minkin, Pennsylvania State University)

11. Why and how to make presentations (Andrey Afanasyev, iBinom)

12. Business in bioinformatics (Andrey Afanasyev, iBinom)

13. Perspectives and problems of systems biology (Ilya Serebriysky, Fox Chase Cancer Center)

14. Laboratory for Systems Biology (Ilya Serebriysky, Fox Chase Cancer Center)

15. Metagenomics (Alla Lapidus, SPbAU RAS)

16. The problem of testing a set of statistical hypotheses (Anton Korobeinikov, St. Petersburg State University, St. Petersburg Academy of Sciences)

17. How to use statistics correctly and incorrectly (Nikita Alekseev, St. Petersburg State University, George Washington University)

18. Mathematical models of gene expression regulation (Maria Samsonova, SPbSPU)

19. Semi-local and local sequence alignment (Alexander Tiskin, University of Warwick)

20. Analysis of families of molecular sequences (Sergey Nurk, SPbAU RAS)

21. Epigenomics, RNA and all that (Andrey Mironov, IITP RAS)

22. Quality control of NGS data (Konstantin Okonechnikov, Max Planck Institute for Infection Biology)

23. NGS data quality control, workshop (Konstantin Okonechnikov, Max Planck Institute for Infection Biology)

24. Practical RNA sequencing (Konstantin Okonechnikov, Max Planck Institute for Infection Biology)

25. Bioinformatic approaches to the study and treatment of cancer on the example of lung cancer (Maria Shutova, IOGEN RAS)

26. New omics in human biology: metabolomics and lipidomics (Philip Haytovich, Skoltech)

27. Genomic assembly: a look into tomorrow (Andrey Przhibelsky, SPbAU RAS)

Instead of a conclusion

Annotation of genomes

Biodiversity assessment

Basic bioinformatics programs

Bioinformatics and Computational Biology

Structural Bioinformatics

Notes

see also

See what "Bioinformatics" is in other dictionaries: