Construction of an ontology from a knowledge corpus for microbial cell production and stabilization system - Internship paper (SELECTED PARTS)


Summary

The overall goal of my mission is to explore the different methods and establish an approach for constructing an ontology from a given knowledge corpus. The domain under study is a topic drawn from Biology: "Microbial cell production and stabilization system". In other words, the general task of my internship consists in providing a pathway and an ontological model able to represent the obtained knowledge, contained in a document corpus.
This project came as a need for reinforcing the scientific field of Biotechnology and the current research related with bioproducts and their transformation proceedings. So, my missions context lies in providing conceptualizations able to describe the complexity of the system under study and the importance of the project lies in analyzing and diminishing any food-related risks.
Concerning my mission, it is essential to evaluate and adapt in an intelligent way, the various approaches and methods related to my goals. Another challenging matter is the fact that, the studied domain is vast, complex and totally 'new' to me. Furthermore, the task of processing the knowledge corpus is quite difficult. It can be considered as a preparation stage performed into different steps, while the ontology is significantly based on it. Nonetheless, the knowledge already contained in this corpus can be helpful in the ontology construction process, by providing information concerning the terminology and the numerous concepts of the application domain. Briefly, this is the framework of my research and work.
Concluding, for each task I worked with different tools for which, I provide a description and evaluation of their functionalities, advantages or disadvantages. In order to treat the knowledge corpus, which consisted of nearly 160 pdf documents and in order to automate the process, I also developed a Python program. From the knowledge corpus, I obtained a huge text corpus from which a list of approximately 12.000 terms was extracted. Based on that list and other relevant information, I proposed different versions of ontological models. In these proposals a total of nearly 600 concepts, individuals, relations and definitions were given.
In this report, I will first try to introduce the project, my mission and present my working environment as well as the hosting institutes. Afterwards, I will focus on the technical details for each task of my mission, presented in logical order. Last but not least, is the evaluation part in which I will try to give a critical review for both my performance and the project in general.

Chapter 1.

The objectives of my mission can be reviewed in the following general tasks :

  • Evaluate and select the most suitable method, language and software, in order to construct an ontology from a given knowledge corpus.
  • At the same time and as a long-term objective, I have to automate the overall chain of treatment that I will follow.
  • Process the knowledge corpus. Extract and analyze its text and find a way to exploit the information contained in it.
  • Develop an ontology. Provide an explicit vocabulary for the various concepts of the application domain.

    Some basic preliminary assumptions prior to my mission are :

  • The available knowledge when modeled can effectively contribute in diminishing any uncertainty factors lying in the transformation proceedings of the system under study.
  • A semantical-ontological framework is the most optimal way to organize information and integrate knowledge into computer systems.
  • A knowledge corpus-based approach will enhance the ontology, in terms of linguistic and statistical information.

    The challenges of my mission are numerous. Some of them pre-existed my involvement to the project, while some others I naturally came across during my 6 months mission. Most of them concern some difficult-to-predict variables, involved in the approach followed. They can be summarized in the following :

  • A controverting task is that of evaluating and taking profit out of the knowledge corpus. Considering the knowledge corpus a text resource, it contains a plethora of information both useful and unuseful, which is hard to distinguish.
  • Providing a conceptualization for a vast and complex domain as the one in my case, can be a rather difficult task, not easy to surpass.
  • I also have to answer to the question of "How to evaluate and choose among the various approaches, languages, tools and softwares", for constructing a model as such.
  • Concerning the ontology construction, it requires effort to be consistent and represent knowledge under a certain perspective, while having in mind the re-use (alignment and evolution) of the ontology.

    Chapter 3.

    First attempt towards solving the problem

    The workflow of my mission and the tasks involved, will be presented in this part of the report. Presentation will follow a logical order. However, in reality working in a project as such requires a more complex time-schedule and multi-tasking activity. A particular working strategy suited to the project, is to iterate multiple times over one or more related tasks, creating this way project layers. For each step of the procedure the solution is found with the best possible outcome. This method is called "Iterative and incremental development". Practically, the overall procedure took place many times.
    To begin with, I selected just a few of the files of the knowledge corpus and I started working with them. They formed a sample directory of the knowledge corpus. From it, I extracted text and created a small text corpus which afterwards I treated in order to acquire a list of terms. With this list I am able to reach for the first time an ontology example. Bellow in the figure is illustrated the result of this process.
    The software used for this task is NeOn Toolkit [5]. This tool is an ontology engineering environment, which provides comprehensive support for the ontology engineering life-cycle. It is based on the Eclipse platform, a leading development environment and provides an extensive set of plugins covering a variety of activities, including annotation and documentation.

    Text mining

    The initial task of my mission is that of Text Mining. At this phase I have to process the knowledge corpus. More precisely, I need to extract text contained in a directory of approximately 160 pdf files. This files are a collection of documents, in which the ontology is going to be based. Implementing a short-term testing strategy, I created a small sub-directory of the knowledge corpus. Initially, this allowed me to save some time while trying to find the optimal way to perform the procedure. In this effort I used PDFMiner [6], a python library – tool developed for extracting information from PDF files, based on textual analysis and parsing techniques.
    PDFMiner [7], was launched under a virtual version of the Ubuntu 12.04 operating system. After installing the software, along with a version of Python, I used an one-line command, in order to extract text to a new text file.

    pdf2txt.py -o filename.txt directory/filename.pdf

    In the above command "pdf2txt.py" is the name of the exact module, responsible for the extraction process. The "-o" variable specifies the output filename,while directory/filename.pdf declares the directory in which the pdf file to be processed lies [8].
    In the first place, the testing directory allows me to launch the procedure multiple times manually, by simply changing the names of the input and output files. In the future I will have to implement a global and automated solution to perform this task. Doing so, will establish the procedure shorter and repeatable.
    Another candidate software for this task is ABBYY PDF Transformer [9], a commercial tool that enables the conversion of PDF files into editable and search-able text format. This software is based in an optical character recognition system (OCR).
    The main reason for choosing PDFMiner over ABBYY PDF Transformer, is that PDFMiner as a Python library gives certain flexibility advantage to the user. This means that I have the potential to modify the analysis parameters according to my needs or even enhance the procedure by adding functionalities, etc. The importance of this advantage will be demonstrated thoroughly in the next chapter of my report.

    Python programming

    As desribed above, once the task of pdf mining is established and the overall chain of treatment has been followed, the next task is to create a text corpus. This task also includes the improvement of the extracted text. The output data have to be re-processed, this time in order to be 'cleaned'.
    During the extraction process, by default, the contents of each documents are extracted to a new text file. But what I need is a text corpus that will contain all of these documents. The problems occurred during the extraction process and have to be solved are :

  • Text layout: Most of the PDF documents have a 2 columns layout. This fact poses a problem in the extraction procedure. As a result, many words, found in the end of a column are split in two parts. Moreover, some words ending in “-ing” or “-ion” are separated in the same way. Their suffix is sent to the next line and later on it will be recognized as a term. Problems as such degraded the extraction results, either by truncating terms or by malforming the extracted contents.

  • Special characters and encoding: Some of the characters, most often out of the text-body, such as names, references, non-English characters or special characters (mathematical symbols, etc.) are misrepresented during the extraction process compromising the validity of the output text file. Similarly in some of the files there is a problem posed by a different encoding.


An important point to make is, that the above problems are related with the nature of PDF document: “PDF 'document' it's nothing like Word or HTML document. PDF is more like a graphic representation. PDF contents are just a bunch of instructions that tell how to place the stuff at each exact position on a display or paper. In most cases, it has no logical structure such as sentences or paragraphs and it cannot adapt itself when for example the paper size changes”.


fig. special chars table


In order to solve the aforementioned issues and join the text files creating the text corpus, I wrote a python verbose program [11] [12] [annexes: 1.Python program p.40-43] that includes the following functions:

  • Run through a directory containing the knowledge corpus.

  • Call PDFMiners pdf2txt.py module to handle these pdf files and extract their contents.

  • Concatenate the extracted text into a single text file, creating a text corpus.

  • Re-write the text corpus into a new text file, in order to clean it from malformed and misrecognised characters.

  • As an optional function in Ubuntu OS, the script might use the shell and call Treetagger to process the text corpus.


The first approach into solving the special characters problem is to write a function that will simply enlist these characters and replace them. However this character-excluding-approach is unsuccessful. Given the amount of data, some problematic characters are hard to detect, when beforehand there is no specific sign of how they will look like.

Natural language processing

In this part of the project the main objective is to obtain a list of terms. This term list will be extracted by the text corpus, constructed in the previous step of my mission. Additionally, the term list can actually be considered as the terminology of the knowledge corpus domain and combined with other pertinent information, it will serve as an index and a source of knowledge, for the following task of the ontology construction. In order to acquire that list, first of all, the text corpus has to be annotated. Some other objectives to be obtained in this task are:

  • Annotation of the text corpus (lexical-category disambiguation).
  • Statistical information concerning the terms (frequency, specificity, weight).
  • Obtain the term's context for each occurrence, inside the text corpus.
  • Try to evaluate the outcome data; evaluate the terms and their significance.

In this part of my report I will describe thoroughly this task. I will also focus on the tools used and the theory related to them and I will try to explain how they work.

POS tagging
Treetagger [13] is a tool for annotating text with part-of-speech and lemma information. It was developed by Helmut Schmid in the TC project at the Institute for Computational Linguistics of the University of Stuttgart and it can successfully handle more than 15 languages and be trained to adapt to other languages as well.

TreeTagger implements a method [14] that transition probabilities are estimated using decision trees in order to apply the part-of-speech (POS) tagging. In linguistics, part-of-speech tagging is the process of annotating a word in a text, corresponding to a lexical category, based on both its definition as well as its context. For Treetagger the disambiguation of a term is calculated by the predictability of the POS from the context it is used. More precisely, most of the applied methods for automatically annotating part-of-speech tags are based on probabilities and Markov Models (MMs) [15]. In methods as such words represent states. And transition between different states is probabilistic. We only know the observation sequence of each state. Given that, the required likelihood is computed by summing over all possible state sequences or by calculating the most likely state sequence. 
However, those methods have difficulties with sparse data; many frequencies are too small so that the corresponding probabilities cannot be estimated properly. Treetagger in order to avoid this issue uses decision trees to obtain reliable estimates of transition probabilities. The decision tree automatically determines the appropriate size of the context which is used to estimate the transition probabilities. Decisions trees are built recursively from a training set of trigrams. Trigrams are a special case of the N-gram contiguous sequence of n items from a given sequence of text or speech, where n = 3. An example decision tree is illustrated below.


 fig. Sample decision tree

In contrast with a simple ngram tagger, Treetagger estimates the transition probabilities with a binary decision tree. The probability of a given trigram is determined by following the corresponding path through the tree until a leaf is reached.

If we look for example, for the probability of a noun which is preceded by a determiner and an adjective p(NN | DET , ADJ), we have to first answer the test at the root node. Since the tag of the previous word is ADJ, we follow the yes path. The next test (tag-2 = DET) is true as well and we end up at a leaf node. Now we simply have to look for the probability of the tag NN in the table attached to this case .


In my case Treetagger was used from the Ubuntu Linux terminal. The basic command line requires at least one argument which is the parameter file. The command used in order to tag my knowledge corpus is:


/path-to/Treetagger/cmd/tree-tagger-sgml-lemma-english-utf8<input_file.txt>output_file.tt

The `tree-tagger-english-utf8` is my parameter file, since the text corpus is in English. The `-sgml` and `-lemma` parameters were used to ignore tokens which were enclosed by “< >” and to print the lemmas, accordingly.
Later on the above command line was integrated into my Python program. The python program automatically calls the shell in which after changing to the directory with the program it passes the command with the parameters. This of course facilitates the procedure, making it a lot faster. For a future use it is simply required to modify the `/path-to/Treetagger` program` and if needed the command line arguments.

Term extraction and YaTeA (Yeat Another Term  Extractor)

YaTeA [18][19] consists in a Perl script for extracting terms from a corpus of texts and providing a syntactic analysis in a head-modified representation. It aims at extracting noun phrases that resemble terms inside a corpus. It also provides their syntax analysis in a head-modifier format. The input file is encoded in UTF-8 and it has to be tagged with Treetagger beforehand. As output the script generates a directory containing the results in various formats, according the configuration: Text, XML and a Treetagger file format.

Term Extraction strategy :


As an input, the term extractor requires a corpus which has been segmented into words and sentences, lemmatized and tagged with part-of-speech (POS) information. Analysis of the term candidates is based on the exploitation of simple parsing patterns and endogenous disambiguation. Exogenous disambiguation is also made possible for the identification and the analysis of term candidates by the use of external resources, i.e. lists of testified terms to assist the chunking, parsing and extraction steps [20].

Endogenous and exogenous disambiguation :

This type of disambiguation consists in the exploitation of intermediate extraction results for the parsing of a given Maximal Noun Phrase (MNP).
All the MNPs corresponding to parsing patterns are parsed first. Unparsed MNPs are processed using the MNPs parsed during the first step as islands of reliability. An island of reliability is a subsequence (contiguous or not) of a MNP that corresponds to a shorter term candidate in either its inflected or lemmatized form. It is used as an anchor as follows: the subsequence covered by the island is reduced to the word found to be the syntactic head of the island. Parsing patterns are then applied to the simplified MNP. This feature allows the parse of complex noun phrases using a limited number of simple parsing patterns. In addition, islands increase the degree of reliability of the parse as shown in Figure x.






Fig. Effect of an island on parsing using MNPs



During chunking, sequences of words corresponding to testified terms are identified. They cannot be further split or deleted. Their POS tags and lemmas can be corrected according to those associated to the testified term.

If an MNP corresponds to a testified term for which a parse exists (provided by the user or computed using parsing patterns), it is recorded as a term candidate with the highest score of reliability. Similarly to endogenous disambiguation, subsequences of MNPs corresponding to testified terms are used as islands of reliability in order to augment the number and quality of parsed MNPs [20].


Term candidate extraction process :

  • Chunking: the corpus is chunked into MNPs.
  • Parsing: for each identified MNP type, except mono-lexical MNPs, different parsing methods are applied in decreasing order of reliability.
    • tt-covered: the MNP inflected or lemmatized form corresponds to one or several combined testified terms (TT);
    • pattern-covered: the POS sequence of the (possibly simplified) MNP corresponds to a parsing pattern provided by user;
    • progressive: the MNP is progressively reduced at its left and right ends by the application of parsing patterns. Islands of reliability from term candidates or testified terms are also used to reduce the MNP sequence of the MNP to allow the application of parsing patterns.
  • Extraction of term candidates: Statistical measures are being implemented, according to the likelihood for an MNP to be a term.

The term extraction process is prompted by passing the following example command into a Terminal.


yatea -rcfile etc/yeate/yatea.rc share/YaTeA/samples/sampleEN.ttg


In the above command, sampleEN.ttg is the to-be-processed tagged version of a text corpus (in English) while yatea.rc is the module responsible for the analysis.

The Ontology


As described before, the ontology development process is an iterative procedure. During this task, I produced several ontological models. Initially I am parsing the term list and for each term I have to decide whether it is an important concept of the domain and therefore be registered as a class of the ontology, whether a term is a sub-class, a relationship, or an instance of a class.

To do so I am mainly using the occurrence context of each term, as well as resources on the Internet, such as:

  • Wikipedia the free Encyclopedia [53]
  • Biology online [54], TheFreeDictionary [55]
  • Ontologies mentioned in the state of the art part of my project


By trying to work and combine all the above resources, I started constructing an ontology. The first domain ontology which I produced counts around 300 classes, sub-classes and object properties, meaning the relations among those classes and another 300 instances.


 fig. part of the ontology



After the meeting mentioned above, my “focus areas” are changed and now my goal is more precise. With the previous ontology as basis, I started restructuring the ontology, this time including only the upper-level, basic components of the system. For this new structuring approach, I didn't go deep into the hierarchy of the ontology. Two significant differences between the two approaches are:

  • In the latest, I removed many intermediate classes. For example instead of adding all the different protein types under a “Protein type” sub-class of class “Protein”, I directly inserted those protein types under class “Protein”.
  • Another difference is that while the first ontology is based on “instances”, in the second one there has been an effort to avoid using instances and instead use a more concept-relation approach to describe things.


Below I will try to describe and explain the restructuring process that took place. To do so, I will provide two example images that concern the same domain concepts. The first image (figure xv.) is from my first work and it consists in only a small part of the ontology. The second image (figure xvi.) is a small part of the latest ontology construction, more narrow and precise.




Figure xv. Part of the first ontology: Concept of Biomass




Figure xvi. Part of the second ontology: Biomass concept



In the second figure xvi. “Biomass” is a class which contains the “Micro organism” sub-class which in turns contains only “Bacteria” and “Yeast”. In the first figure xv. as we can see there are other sub-classes of “Biomass” as well which, for the time being, we decided not to include them in the new version of the ontology.

In the above figure xvi. in comparison with the first figure xv., the “Yeast” subclass, contains three different Yeast species: “Candida Albicans”, “Saccharomyces Cerevisiae” and “Yarrowia Lipolytica”. Additionally, there are three registered “Strain” of “Yarrowia Lipolytica” which can be either “Mutant” or “Wild type” strains.

The two differences discussed earlier can be demonstrated by looking at the concepts of “Bacteria” and “Yeast”. The first figure xv. shows that there are sub-classes and instances of “Bacteria” while in the second figure xvi. there are none. Similarly, there are four intermediates sub-classes of “Yeast” and I am using a “Yeast species” class to for the three registered species.

Furthermore, in the latest ontology some of the basic concepts I included are can be demonstrated in the picture below.



Figure xvii. Main concepts included in the latest ontology.



In the above figure., the plus symbol in the left upper corner of a class means that it contains sub-classes and can be expanded further.

In this ontology I included the basic Cell parts (Cytoplasm, Golgi complex, Wall, etc.). In the Molecular level there are classes as “Gene” and “Protein” and “Metabolic pathway” which in turns contains three examples of different pathways (“MAP-kinase”, “Calcineurin” and “Proteolytic”). Also, I included the seven transformation proceedings of a product, as described in the representation of the system image and the “Stress” class with different types of environmental stress or stress factors.

What is important here is to model and describe the interactions between these various concepts. Take for example the sentence: “When some kind of stress is applied to the cell an adaptive response is triggered and a Metabolic pathway is activated”. This sentence describes a process of the system and in order to demonstrate it the classes “Stress”, “Cell” and “Metabolic pathway” are used and “adaptive response”, “applied” and “activated” are used as object properties or relations among them. In comparison with the first version of the ontology constructed, this one contains less concepts and individuals but can be considered well defined and coherent. Moreover, many textual definitions for these concepts are given. Finally, it is important to say that the ontology is currently under development.
 


No comments:

Post a Comment


Free online chess

View Kapellas Nick's profile on LinkedIn
Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License