|
The AST/TA approach has been part of the research at the Humboldt-University Berlin circulating around better models to make up XML databases. This variant will extract the XML markup from an XML document to store it in an optimized tree structurfe - the "Abstract Syntax Tree" - while the remaining pure text content can be stored linearly on disk - the "Text Array". This storage model on gigabyte mass storage media allows for a highly optimized variant of XML databases. I was enrolled of the team at the chair of Prof. Freytag which was running the project subsequently deriving a study thesis about the implementation of the textual pTA. The actual research topic about the AST/TA was targetting to show the hierarchic semistructured data allows for fast navigation but at the same time it allows for ultrafast restructuring of the markup plus the ability to store multiple access trees over the same text content. To help speeding up the access the team was developing optimizations in the "XML Query Execution Engine" (XEE). The experiences with that model made me think to use it on a something quite different. For that I did create my own offspring project with an AST/TA implementation for main memory only testing it for a documentation processor. Here I took advantage of the linear text array that can be scanned with traditional parsing engines (mostly regular expressions are sufficient actually). with any structured data to be actually made up as tree nodes that were referencing back to the text content. Such allows to store all the intermediate steps of the documentation creation in a single data tree that can be saved at any point, possibly be transferred to a third party XML tool, being subsequently be restored for further processing in the engine. The implementation parts have been wrapped in the XML/g project that were employed in making up the documentation of PFE project that was not only consisting of C/C++ symbols but it had also some embedded Forth symbols that needed to be recognized in the middle of the standard syntax. Despite the number of files and thousands of symbols we find that the execution was faster than a similar parsing and xml transformation in contempory XSLT/docbook tools. I was hoping to show added advantages stemming from the ability to chain multiple XML transformation steps (processing pipelines) especially in the field of bioinformatics but in the end time was running out. The academic paper with a performance comparison is still unfinished but its parts have nethertheless documented the potential of xml-based documentation pipelines and its advantages and disadvantages over other approaches. |
|