Bouwers > Eric > Blog: PHP-Front

Showing posts with label PHP-Front. Show all posts

Migration to the nine-headed monster

After some recent activity on the psat-dev mailinglist I became aware of the (lack of) available builds for php-sat. Even though the development speed is not what I would want it to be (so much fun things to do, so little hours in a day!) I still believe it is important to release early and often.

Fortunately, Eelco Dolstra had some time to migrate php-front, php-sat and php-tools to Hydra, the new Nix-based continuous build system. After some tweaking we now again have access to unstable build for all PSAT-projects. Go Hydra!

Visualizing the PHP grammar

A few weeks ago an e-mail from Didier Garcin popped up on the Stratego mailing list. He explained that he had written a python script that could visualize an abstract syntax signature. The script was also send to the list and I finally had some time to check it out.

It turned out that I had almost anything installed to use the script, only the pydot dependency needed some work. This was mostly because the script only seems to work with the 0.9.10 version of this library. After getting the script started it was really simple to generate the signature from the PHP-Front grammar.

So, here is the one for PHP4:

And the one for PHP5:

The first thing I notices where the big rectangles in both versions. These rectangles are the statements (smaller one) and expressions grouped together. What I also notices was that in both versions we see that the bottom of the graph (which corresponds with the smallest units in the language) looks the most complicated. This corresponds very well with the amount of effort put into the modeling of this part of the language.

Comparing both grammars to each other we can see that the latest versions is the most complicated one. Furthermore, if we compare both graphs to the graphs shown in this post we can see that the Java-graph appears to be the most similar one.

Actually, I do not think these images show anything, but please explain it to me when you think I just don't see it. Anyway, at least we have some nice pictures now :)

P.S. for those who are interested, more detailed images (in svg-format) are available here.

A little migration helper

You probably already heard about it, but support for PHP version 4 is partly dropped as of 31-12-2007. And after 08-08-2008 the support ends completely.

Luckily, the PHP documentation team provides you with a set of migration guides. Going through these guides can take some time, but it enables you to upgrade your code easily to be PHP5 compatible.

To make your life even easier, the PHP-tools project is extended with the tool test-migration. This tool performs some checks that are described by the migration guide to detect whether the code can be run under version 5 of PHP. These checks include:

Is there a function definition with the same name as a function newly defined in PHP5?
Is an object created without the class being defined first?
Where are the functions strrpos, strripos and ip2long used?
Is there any place in which there is reflection within PHP that uses changed behavior?

The first two checks are rather easy to understand, PHP5 will simply halt execution with an error when these issues are detected at runtime. Therefore, the warnings that are generated for these kind of patterns are shown with a 'serious'-level.
On the other hand, the last two checks do not find constructions that can halt execution. They detect places in which certain constructions are used. These constructions where already available within PHP version 4, but their behavior changed in version 5. To easily find these constructions they are flagged by a 'minor'-warning. More details about the changed behavior can be found here.

The first version of this tool is quit basic and performs only a few checks. If you would like to see more check included, don't hesitate to drop me an email or put them in the comments.

It is getting noticed

Within the last two weeks I noticed that PHP-Front and PHP-Sat are being discovered by people that are looking for PHP-specific solutions. It is not that we are flooded with request, but I am still happy with every question :)

The first question was send to the psat-dev-mailinglist and was about the Cyclomatic complexity of PHP code. I replied that it would not get into PHP-Sat because it is not a bug-pattern. However, it would be a nice tool for the PHP-Tools project. I made a similar tool for Java because of an assignment in the past, so it is probably just a matter of renaming the Strategies to use the PHP-Front api. Unfortunately, I didn't get an answer about how the report should look like. If you have any ideas please let me know in the comments, or in the issue.

A second question was about the grammar of PHP-Front, or actually the license of this grammar. The people behind TXL have derived a PHP-grammar for TXL from the SDF-grammar in PHP-Front. Since our license does not state anything about derived work without common source, we were asked for our permission to distribute this new grammar. Naturally, this permission was given very quickly and we were also allowed to take a peak at the source. I must say that I find it interesting, but I currently do not have time to look into it all. I imagine that the definitions of the grammars and TXL itself is similar to how things work in Stratego, but I have to look into that.

The last question in this series is about defined functions. Finding out which functions are defined in a project is easy when classes are ignored, a simple grep on the project will do. When a project also includes classes it becomes trickier to get all functions defined outside of a class. The question was whether PHP-Front could help with this issue, and the answer is of course yes!
Within the reflection part of the library the list of defined functions and classes is already available. This makes it possible to write a tool to show all defined functions in just a few lines of code. Since it was also a nice tool for the PHP-Tools project I added a tool for this last Friday.

Another issue that was brought up by the last e-mail is the issue of our implementation language. Since Stratego is relatively unknown the project has a steep learning curve. On the other hand, if I had chosen a different implementation language it would have taken me way longer to implement the current features. And besides, this piece of code is not that hard to understand right?

  defined-functions-main =
    include-files-complex
  ; get-php-environment
  ; get-functions
  ; if ?[]
    then !"No functions defined."
    else  map(transform-to-message)
        ; lines                    
    end

Where you at?

Now that the propagation of safety-types seems to go smoothly it was time to dive into another subject: accessing the location of terms. In this case, the location of a term is defined as the location of the text in the original file that was parsed to that term. Since the strategy to annotate the AST with position-info is available in the standard libraries nowadays, it should be easy to access these locations and finally solve PSAT-91 right? Lets find out!

The first thing I did was to add a separate module to handle the low-level stuff of getting the location annotations. This module contains several getter-strategies that can retrieve, for example, the number of the start-line. The location info is captured in six different numbers: start-line, end-line, start-column, end-column, offset and length. A getter-strategy is available for all of them. Furthermore, the name of the file in which the term is defined can be retrieved.

Although these getter-strategies are useful, they are not meant to be called directly. I figured that the most common use of these functions would be reporting the values in some kind of (formatted) message. In order to capture this kind of behavior the strategy format-location-string(|message) is defined. This strategy takes a message with holes in the form of [STARTLINE] as parameter and fills these holes with values from the current term. A rather useful strategy if I say so myself.

To practice with this new piece of functionality I have added an extra option to the tool input-vector of the php-tools-project. This option allows the user to choose between the normal list, or the same list with line-numbers printed for each access. More information about this option and how to add an option yourself can be found here.

After this was done I moved to php-sat to make the output more concise. It was actually pretty easy to implement. The algorithm is nothing more then get-terms-with-annotations, make-nice-output. I actually spend more time on creating a test-setup for calling php-sat through a shell-script then on generating the more concise format. The only problem was that the adding of position-info everywhere interfered with the dynamic-rules. A few well-places rm-annotations where needed to fix this. Please let me know if you like the new output, or whether something should be added.

The next applications of the location info is the tracking of where untainted data enters an application. When a function is called with a parameter $foo which is tainted, it would be nice to show when it was tainted. I think this is not too difficult to add, but bugs always seem to lurk in 'I-think-it-is-easy-to-add'-features.

A last remark about locations is a small problem without an actual solution. Eventually php-sat must support function-calls. The algorithm to analyze function-calls is not complicated, but how can bug-patterns within a function be reported? A message before each call to this function? Within the file in which the function is defined? And what about cases in which one call is flagged and the other one isn't? And can we also handle object-creation in the same way? I haven't figured out how to handle this, so if you have any ideas please let me know.

Merging fun

Yesterday, commit number 379 and 380 introduced a renewed implementation of the constant-propagation, and new functionality for finding vulnerabilities. You guessed it, the merging of constant-propagation and vulnerability analysis has taken off! The cool things are that 1) the old tests all pass, 2) some new tests pass and 3) I get to write a lot more tests!

The main reason for starting the integration of both analysis was the fact that I saw a lot of code duplication popping up. This duplication was caused by the fact that the bookkeeping of internal structures is the same for all strategies, the code only differs for the properties of values.

With this last piece of information I started to merging. My first approach consisted of trying to pass a list of get-set strategies to the main-strategy and calling these strategies dynamically. This was obviously not a very good or nice start because it always seem to result in numerous segmentation faults.

Thinking about the problem made the second attempt somewhat more pragmatic. Instead of generalizing I just wanted to make it work for constant propagation in combination with a second analysis. So I rewrote the strategy to receive two strategies, one for getting the properties of literals and one for getting the properties of operators. The choice for these language constructs is based on the reasoning that these constructs are the only ones that make or manipulate the actual properties of values. The other language constructs manipulate the flow of the values instead of the actual values.

When this all seemed to work the more difficult challenge had to be solved, manipulation of variables and arrays. It turned out to be more simplistic then I thought because of the indirection between the variables and their values. For the purpose of dealing with aliasing, a variable does not point to a value but rather to a value-identifier. This identifier points to the actual value. This makes the creation of a reference easier, we just create a mapping from a variable to the value-identifier of the referenced variable. Because of this indirection we can simply make the value-identifier point to more then one property, implemented by a dynamic-rule for each property. This makes the merging of sets of dynamic rules a bit harder, but not impossible.

It might sound simple (or incomprehensible), but getting everything right was still a bit tricky. For example, when an assignment is made the property of the RHS must be known before the LHS can be assigned this property. So what happens when the constant propagation cannot compute a value, should we simply fail to assign a property?
The answer is no, the second analysis might still be successful. These kind of little problems made the implementation a little less straight-forward, and the code a little less beautiful.

However, the result of it all is that the following example is now flagged correctly:

<?php
 $foo = $_GET['asdf'];
 $bar = 1;
 $bar =& $foo;
 echo $bar;
?>

In this case, echo $bar will be flagged by the latest php-sat.

I experienced one problem with the implementation that is related to the semantics of Stratego. My first attempt in adding an annotation to a term was something like this:

 add-php-simple-value(|val):
    t{a*} -> t{annos}
      where  b*    :=  a*
           ; annos := [PHPSimpleValue(val) | b*]

This works perfectly, the annotations are matched as a list by the *-syntax, and a list is added as an annotation to the term again. The only problem with this is that the second time this rule is applied it matches the annotations as a list of a list of annotations, which was not the behavior I desired. This problem is easily solved by also adding a * to build the term:

 add-php-simple-value(|val):
    t{a*} -> t{annos*}
      where  b*     :=  a*
           ; annos* := [PHPSimpleValue(val) | b*]

Now the list of annotations is not wrapped in an actual list anymore. I know it is documented somewhere, but this little explanation might save some others from an headache or a long debug-session.

The next step in the analysis for vulnerabilities is a rather important one: testing. Even though the basic parts of variables and assignments are already tested, there exists a large number of scenarios that need to be tested on this new integration-strategy.
But hey, testing is fun!

The operator type

My first intention was to name this entry after one of these quotes about assumptions, because in the past week I found out (again) that they are totally correct.

Last Sunday I started to work on typing the operators of PHP. My first attempt was really basic, simply follow the documentation and everything will be alright. So I started with the arithmetic operators '+','-' and '*'. They all act the same on the type level in the sense that the result is always integer, unless there is a float involved. The next operator in line was division ('/') which always returns an float (according to the documentation). I finished off with the modulus operator which behavior turns out to be ...... not documented.

Since I want PHP-Sat to follow the semantics of PHP I needed to know the exact semantic of this operator in PHP. Naturally this is not hard to figure out because we simply run PHP on some test-data and use the function gettype. I wrote a little script with some lines like:

...
echo "Type of 5/2 = ", gettype(5/2), "<br />";
...

which is not only tedious, but also a complete violation of the DRY principle.

In order to comply with DRY I turned to the eval function of PHP. This function evaluates a string as PHP code and can actually return the result of a its computation! I used this function within a loop to dynamically calculate the result of applying an operator to different types. You can find the complete script here, but I will show the critical part here:

foreach($types2 as $key2 => $val2){
    $code   = 'return ' . $val1 .' '.$op.' '.$val2. ';' ;
    $result = eval($code);

    echo ''. gettype($result) . ' (' . showval($result)  . ') ';     
}

The foreach-loop resides within another foreach-loop looping through a copy of the array $types2. This gives us access to two different types and the current operator. These are combined within a string which is executed. We then extract the type and the result of the execution and display it.

Basically you just simply pass an array with types and a operator to the function and it will output the type and result of all possible combinations. I put the output of this script here, you can find the types and values that I used before the actual result. The first few sections contain most of the unary operators, the binary operators are organized in tables. Within the tables each row contains the first value given to the operator, each column is the second value. I apologize for for the ugly formatting, just trying to be a good programmer.

It was now a trivial task to read type(s) of the modulus operator from the table and be amazed. The result type is almost always an integer, but it can an also be a boolean type! This comes from the fact that whenever the RHS is a '0'-value the modulus cannot be calculated. Where other languages would simply halt execution, PHP returns a 'false'-value and issues a warning. This is also the case for division and is now also handled correctly in PHP-Sat.

You might wonder why I have chosen to test all of the operators instead of just the modulus. The first reason for this is that I could simply not resist the temptation, it was just to easy to add all the operators. The second reason was that the first run of the script also involved the testing of the division operator. According to the documentation it always always return a float, but the table tells us that it return either a float, a boolean or an integer! I might be able to see that the boolean return-value is not documented, it is almost certainly a mistake to divide by zero, but the documentation specifically states that division never returns an integer! Luckily (yes I believe it is a good thing), this problem is already mentioned by someone else and I hope it will be fixed before the year ends.

Back to PHP-Sat, the todo-list regarding operators has been heavily shortened. The only ones left are the assignment- and string-operators. The first group involve some dynamic rules which I want to implement when I am in a more awakened state-of-mind. For the second group I still have to figure out which safety-type is to be assigned when the safety-type of both side differ. I currently believe it should be the lowest safety-type, but please feel free to provide me with a counter-example that shows that this belief is wrong.

Taking precedence

It has finally been done, PHP-Front has perfect operator precedence!

And why can this be written down as a fact? Because of the work of Martin! He has worked on the grammar-engineering-tools tool for generating SDF priorities. When I worked on them I ran into some problems because of the decisions that where made during the initial development (read: when the tools had to be finished yesterday). This resulted in a common format for PHP precedence rules that was neither YACC-like nor SDF-like. Martin has rewritten the tools to use most of SDF representation and this worked very good according to all the solved issues. The operator precedence is now encoded into PHP-Front as separate modules for both version 4 and version 5, both of them over 6k LOC.

The (technical) details about how these files are generated will probably make a long and interesting blog-post. Since Martin says it is hard to find such a subject, not to mention (again) that it is his work, who am I to take this subject away from him? Maybe he can find some time after preparing the LDTA-presentation for Thursday. I won't be able to make it, but if you are close to Delft you might want to sneak into the Research Colloqium. (If you do, please tape it!)

Besides bringing perfect precedence to PHP-Front, this work has also resulted in a new issue. It was not very hard to figure it out, I just misinterpreted the note in the documentation. So I immediately fixed it by removing a special case which did not have to exist, just to put the icing on the cake.

Including only once

This week I finally got around to the implementation of a strategy that combines constant-prorogation and some sort of type-state. It was a bit tricky to get it completely right, resulting in some very weird behavior during testing. Fortunately, the problem has been solved and the result is pretty good!

So what problem does this strategy solve? In order to explain this we will first take a quick look at how constant propagation is being handled within PHP-Front (and other Stratego-libraries).

When a value of a variable is known, for example when it is defined in an expression like $foo = 5, PHP-Front will add a dynamic rule rewriting the variable to the known value. This dynamic rule can later be used to retrieve the value, for example in the expression $bar = $foo+1. This last expression will then be rewritten to $bar = 5+1 which can be completely calculated again.

Using dynamic rules for linear code is easy, but when code branches it becomes a little bit more complicated. Take this example code:

   $foo = 'say';
  if($val){ 
      $bar = 'hello';
  } else {
      $bar = 'world'; 
  }

The value of $bar can not be statically determined, but we want to keep the knowledge about the variable $foo. A valid way to do this is to take the intersection of the two sets of dynamic rules, the normal one and the one after evaluating the if-branch. This will keep all information about variables that are not used, or given the same value, within the branches of the if.
If you want to know more about dynamic rules and constant propagation I suggest you take a look at this paper.

Besides the information about the constant propagation we need to take a look at the inclusion mechanism of PHP. You might be aware of the fact that PHP offers both an include as well as an include_once construct. The first one will always include a file, the latter one will only include a file when it wasn't included before.

Dealing with files that are always included in combination with constant propagation is rather easy. We just retrieve/parse/evaluate the file during the constant propagation. Dealing with files that are only included once is also easy with linear code, but things get tricky when branching is involved. Take this piece of code for example:


  if($val){
    include_once 'foo.php';
  } else { 
    //other things
  }
   //some code
    include_once 'foo.php';

Within the if-branch the file foo.php might be included and introduce some dynamic-rules. The code for the if will handle this probably so that is not a problem.

But what must we do when we arrive at the other include_once-construct for the same file? Well, there are three different scenario's that can occur:

File has not been included
File might have been included
File is definitely included

The first and last one are easy, just include the file or not, the second one is more complicated with respect to dynamic rules. We will again have to take the intersection of the current set of dynamic rules and the set of dynamic rules after inclusion of the file. This will make sure that we only keep the rules that will be the same in both cases.
Notice that we will have to keep track of the inclusion-state of every file. This can (and is) implemented as a state for a file, again with dynamic rules.

Implementing both of these solutions with dynamic rules separately is not that difficult, there are enough sources to figure things out. However, combining these approaches is completely trivial because you will have to deal with dynamic rule-sets and all sorts of merge-strategies. Fortunately for you this is all done behind the scenes. Please enjoy the better constant propagation as of revision 332.

PHP-Sat.org finished

It took me a ((very) long) time, but I finally finished all the texts for PHP-Sat.org. It is always a challenge to think of the right subjects and placement of texts for a website, but I believe that everything is in the right place now.

The website contains all the basic things a project need: general information, source repository, mailing-lists and bug-tracker. The current information is quit basic, but it will grow over time.

But the website also contains some extensive documentation on the following topics:

This is also the reason it took some time to finish the site, but I think that it was useful. It gave me a a change to think about several aspects of the project.

Please let me know if you find any (spelling)-mistakes in the texts, still practicing my English over here.

PHP Operator precedence

If you take a look in the PHP-manual regarding operator precedence you will find the following quote:

Although ! has a higher precedence than =, PHP will still allow expressions similar to the following: if (!$a = foo()), in which case the return value of foo() is put into $a.

This is not very informative, which expressions are similar to this expression?

The answer to this can be found in this (huge!) file. This file is made with a tool called 'generate-prec-rules', which was created this week. It does exactly what the name implies, it generates the precedence rules given a set of common-precedence rules. The file above is created with the precedence rules from the yacc-definition of PHP version 5.

But how can we interpret these rules? Let's take a look at an example rule:


context-free priorities
  "@" Expr -> Expr 
  <1> >
  Expr "/" Expr -> Expr

This rule disallows a "/" on the second position of the '@'. Please note that this notation also counts the string-literal.

However, the expression @ 4 / 5 is not invalid, it can still be parsed and evaluated. The rule above just explains how PHP places the parenthesizes in the expression, in this case like this:
(@ 4 ) / 5.
Take a minute to think about the expression and how it is influenced by the above rule, it took me a while to figure it out.

It is nice to have all of the rules, they describe the operator precedence very precisely, but the size of the file is a bit problematic. Every operator, there are about 32, gets a rule for each other operator for each expression it has. The '*' alone already occurs in 82 rules.

So the next step that we will have to take is to combine rules without changing the semantics of the precedence relations. An example of this is the combination of these two rules:


context-free priorities
  Expr "*" Expr -> Expr 
  <2> >
  Expr "+" Expr -> Expr  

context-free priorities
   Expr "*" Expr  -> Expr 
   <0> >
   Expr "+" Expr  -> Expr

into this rule:


     Expr "*" Expr -> Expr 
 >  Expr "+" Expr  -> Expr

Which means the same, but is a bit more compact and probably more understandable.

So when this tool is created, and some things are changed within SDF, we can incorporate the newly generated precedence rules into PHP-Front. This will solve some parsing problems and will give us an even better SDF-grammar for PHP. We will also try to generate a more understandable, one page list of the operator precedence within PHP to replace the quote in the current PHP-documentation.

A nice example of 'research meets real life' don't you think?

String representations

With revision 299 we (again) have a list representation for DoubleQuoted- and HereDoc-strings. So the string "hello \t world" is represented by:


  ConstantEncapsedString(
         DoubleQuoted([Literal("hello ")
                                           ,Escape(116)
                                           ,Literal(" world")]
      ))

This not completely new because the first implementation already had this, but this representation had some problems. There was no way to model a hexadecimal escape without making things ambiguous. This problem can be solved by making the order of the internal parts explicit, but we then had a terrible representation of the string:


  ConstantEncapsedString(
       DoubleQuoted(
              DQContent(Some("hello ")
                                       ,Escape(116)
                                           ,Some(" world"))
     ))

Note that this string is represented with 1 of those DQContent-thingies, the biggest one has three children. So every string with more then 3 parts has nested DQContents. Let me give you an example, the string "Hello \\\0123" looks like:


  ConstantEncapsedString(
          DoubleQuoted(
            DQContent(Some("Hello ")
                                      ,DQContent(Escape(92)
                                                                   ,None
                                                                   ,OctaChar(48,49,50))
                                      ,Some("3"))
       ))

Terribly right?

So this had to be solved by a post-processing step. We have to walk over the tree bottom-up to rewrite these nasty DQContent's to a nice list. It is not nice to have such a post-process-step, but this was already required because of HereDoc-strings.

The problem with the HereDoc is analogous with the problem of the Dangling-else. If you have multiple HereDoc-strings with the same label you will have to choice where the first HereDoc ends. PHP always takes the shortest HereDoc so this piece of code has two variables:


<?php
    $foo = <<<BAR
       foo...
  
BAR;
  
    $bar = <<<BAR
       bar
  
BAR;
?>

As long as HereDoc is ambiguous this is easily solved by choosing the right amb-node.

But after the rewrite to the new internal implementation, HereDoc became unambiguous! This is a bit frustrating because it takes the longest HereDoc, which is wrong. I spend some time in trying to get it right, but I could not make it work (yet). So a new puzzle has entered the project, happy christmas! :)

P.S. As said in the last blog, I can hope, but maybe I should just read.

What's the deal with ... ?

What's the deal with the lack of updating of PHP-Sat?
The last few weeks were filled with all sorts of other interesting activities, at least for me. My days are mostly filled with doing research for my master thesis and writing my proposal. In the past two weeks my free time was filled with the SUD, the visit to Google and the Grammar Engineering Tools. So plenty of projects, but little time to work on PHP-Sat.

What's the deal with this thesis then?
The thesis is the final project/course I will have to finish before I graduate. I will do some research to validate a certain algorithm that will improve feedback in educational programs. A more detailed explanation of the subject/algorithm/approach will be put into a blog soon. It's a totally different subject, but still interesting for people in computer science as in education.

What's the deal with this Grammar Engineering Tools project?
As mentioned before, the paper that was the result of the project was already submitted. Some improvements are made and the tools are (going to be) updated. My work on this project was still in the interest of PHP-Sat though. We have gotten a great insight in which combinations of operators are valid and which are not. This information will be put on the web as soon as we have a reasonable format for it, so stay tuned!

What's the deal with those labels/link underneath the posts?
The new version of blogger allows you to have labels under a blog. I thought is would be nice to order the posts according to certain topics. People that only want to read about my thesis can look here, those that only want to read about PHP-Sat here. I hope that the people behind blogger will add a RSS-feed per label in the future, but I can only hope :)

A Doh-situation

I had a real 'Doh'-situation yesterday, it happened when I was trying out php-sat on several PHP-projects. I began with trying php-sat on the sources of Phorum, phpBB, phpMyAdmin and Joomla.

The results for these projects are very nice, considering the current status of the tool. I was surprised to see that the pattern C003, which is very simple and even documented in the PHP-documentation, was actually flagged in both phpMyAdmin and Joomla. The functions to blame can be found in Table.class.php (generateAlter and generateFieldSpec) from the phpMyAdmin-source, and database.php (database) from the Joomla-source.

A pattern that was found quiet often in phpMyAdmin is pattern O001.
I have done some simple tests and it turns out that passing multiple parameters is roughly twice as fast as using concatenation. This blog-entry confirms this by explaining what happens behind the scenes in both cases.
So the pattern describes a possible optimization that is not very useful if this patterns occurs only once. But if it happens in 32 cases in a single source file that is called common.lib.php, which is probably used commonly in the project, then it might be useful to take a look at it.

But back to the 'Doh'-situation. After trying php-sat in the projects I wanted to run it on a large php-codebase. A great resource of PHP-code can be found in the PEAR repository. The PHP-code for PEAR itself is about 30000 LOC so I gave this to php-sat first. I was not surprised that it took a while because the code is pretty large. But I got a little suspicious after about 10 minutes of just seeing files being parsed. So I put some more log-messages in the php-front-library and tried again. It turned out that php-sat was including 3 files in a cycle, but the includes happened because of a 'require_once'. I was pretty sure that I had fixed this problem by checking whether a file was previously included, and this was indeed the case. The only problem was that I put the files that where included in the environment after going into a file. The solution is pretty obvious, but such situations just call for a slap on your own forehead.

Propagation

So this week was still filled with constant propagation techniques, but also with software metrics. I have given a presentation about functional software metrics which was based on this PhD thesis. It is very hard to find papers about software metrics for functional languages, but it is even harder to find them for scripting languages! Please let me know if you know a website that covers metrics specifically for scripting languages, I don't expect there to be any papers about this subject.

I am looking for software metrics for scripting languages because I think that metrics can help in finding the weak spots in a program, but there has to be evidence to support this. Validation of software metrics is rather time-consuming, so hopefully done by somebody else :) It shouldn't be hard to make a tool based on php-front to generate a rapport with software metrics about your program, any volunteers?

But back to the constant propagation. I have added support for the propagation within loop-constructs, if-statements and switch-statements. This was not very hard because I could use the techniques described in this paper. The construct that was not discusses in this paper, or on any slide of the pt course (which is not given anymore, unfortunately), is the if-elseif-else construct. It is not very hard to come up with the (hopefully correct) semantics for this, but I could be wrong of course. I will explain my interpretation in the next section, but it might be hard to follow. Feel free to ask for a better explanation!

The if-elseif-else construct is evaluated as a list of list of statements. The first statement is used as an unit-element in a foldr over this list. This foldr applies a given strategy to every element, which is to be expected. But it also performs an split-and-intersect of the dynamic-rule-set between every two elements. The result of this is that the rule-set will only contain the dynamic rules that are stable in all branches of the if-elseif-else statement. The implementation of the switch-construct was pretty easy after I made the infra-structure for this "split-and-intersect"-foldr.

So the constant propagation is coming along very fine and is also added to the php-sat-tool. The new flag --complex-inclusion triggers the tool to perform constant-propagation and include the files that can be included in that way. The constant propagation is not completely finished, to consider this an 'alpha-feature in this 'beta'-tool.

Not the only one

It is always nice to know that you belong to a group. I discovered this week that I am part of a large group of people that still works on their SoC project. It turns out that many people still try to work on their project, but at a slower pace. I am one of those people :) I am also one of the people that can walk around in a Summer of Code 2006 t-shirt (front,back), it arrived this week. But enough about bragging, let's talk about PHP-Sat again..

You will probably know already that Google has released Code Search, a search engine specifically for code. It turns out that it can be used to find vulnerabilities in PHP sources. The regexp for XSS due to echoing raw input, which is heavenly quoted on the Internet, gives around 7000 results. There are off course false positives in these results, but many of the results would be 'useful'. Could there be a better way to show that PHP-Sat is needed?

So I have been working on PHP-Sat and, more specifically, on the completion of the constant-propagation. I have rearranged the tests for the simple-evaluation. There used to be one big testsuite with over 450 tests on almost 3000 LOC. There are now separate testsuites for expressions, operators, primitives and so on.

Functionality is added in the sense that constants can now be defined and actually used. I have also begun with the implementation of references. The manual says that references are just pointers to the same memory location. I have mimicked this behavior by adding an extra step between a variable and its value. Every variable is rewritten to a 'variable-identifier', this 'variable-identifier' is rewritten to the actual value. Introducing a reference is now a simply introducing a rewrite from a variable to a 'variable-identifier'.

Oh, and I have also updated the implementation of pattern C002 to handle static function calls. Mmmm ... I should really put more time into those patterns.

New and Improved inclusion

This week was not only filled with going to the university and playing around with PHP-SAT, it was also filled with (a lot of) preparation for the Flitsjacht, a game for scouts in our region. Organizing a game that is played by over 150 people in 11 teams in an area of about 10 square kilometers takes a lot of time, but the photo's are priceless.

Organizing this game does not mean that there is no progress on PHP-SAT, revision 256 holds a few new things. It includes some new tests, a few extra lines of source code and a lot of new functionality! The basis for the '--complex-inclusion'-flag is finished with the evaluation of the internal functions 'include','include_once', 'require' and 'require_once'.

The implementation of this evaluation was not that easy because of a small detail in the strategy 'find-in-path'. This strategy takes a list of paths and tries to find a file in those paths with a given file name. It does this by checking the existence of each 'path + file name'. This would always return the full 'path + file name', the next paragraph explains why this is useful, and worked like a charm. Until I tried it with a file name that was available at the current-directory! I did not get back the absolute path, but the original given file name. It took me a while to figure out that the strategy first checks for a file with only the given file name without taking the paths into account. I solved this by just looping over the list with directories and now everything is fine.

But why do we need the absolute path to include files? This makes it a lot easier to implement the logic of the 'include_once' and 'require_once' functions. Before we include a file we check if the file was already included by using the absolute path. The absolute path is necessary to prevent false positives in this case. False negatives are prevented by normalizing every path after a file is found. This normalization removes all '.' and '..' steps in a path and returns the direct and absolute path.

Next step: evaluate some more internal functions, especially the 'define' function that defines constants.

I love my test suite

When I started to implement the evaluation of control-statements for the constant propagation I realized that I made a wrong decision. This occurred to me when I implemented the control-flow with a single if statement. One of the side-effects of such a statement is that it should undefine variables that are changed within the body when the condition can not be evaluated.
For example:

<?php
 $foo = 'bar';
 if($bar == 1){
   $foo = 'foo';
 }
//Value of $foo is unknown
?>

Because we do not know the value of $bar we do not know the value of $foo after the condition. So the dynamic rule that maps $foo to the string-value "bar" should be undefined. Stratego has special syntax for these kind of situations and this makes the implementation very easy.

When I wanted to do this for arrays I realized that arrays where implemented as a key-value mapping in a hashtable. This is correct, useful and it works, but undefining mappings takes a lot of work because the special notation would not work with this implementation. So I had to change this implementation and use dynamic rules.

My test suite really helped me in the refactoring of this implementation. It showed which part where updated correctly and I could use the tests to analyze why the rest still failed. It also made me aware of the fact the when a dynamic rule is undefined the key still exists. The application to this undefined key will always fail, as expected, but the key still remains in the set of all keys of a dynamic rule. The test suite also identified some hidden assumptions in the old implementation, which where immediately removed.

All this information would not have been available to me without the test suite. To quote a large company: "I'm loving it!"

Anything new?

Sorry for the late update, college is really getting started so I have to get used to going to lectures and meetings again. But my php-*-projects are still going strong. The evaluation strategy can handle construction of, assigning to and reading entries from arrays now. An array in the interpreter is modeled as a map which maps keys to values. Some useful strategies have been implemented to easily add and retrieve values from the arrays. The documentation about arrays tells us that PHP also sees an array as a map, so the implementation of the semantics was not that hard. The only thing that it does not support (yet) is references. These will be added later.

The second thing that has been added is the fix for the long lasting psat-2 bug-entry. This states that HereDoc should be parsed as a sequence of literals and escapes, just like double-quoted strings. This turns out to be very tricky. Double-quoted strings have a very straight-forward start and end-point, HereDoc does not have this. The end of a HereDoc is marked with a newline-label-newline sequence. So the newline was added as an escape in a literal, we have to treat this newline in a special way if it is followed by a label. The problem that arises here is that we cannot express this within SDF. This should be written down as a follow-restriction on the newline, but then we would have to know the length of the label. This length is not fixed, so we can not do this. By not setting the follow restriction a literal became ambiguous when there was a variable in front of it. The solution here was to write out the definition of 'a list of literals or escapes'. This definition was extended by 'where there are no two literals in a row'. This solved all the problems for that moment and everybody was happy!
This happiness lasted until I tried to parse a file with a statement after the HereDoc, this failed. So a statement after a HereDoc caused the parser to keep parsing until the end of the file, something that we definitely do not want. So the problem here was that the end of a HereDoc was not strict enough. I tried several things, but nothing seemed to work. The optional semicolon was causing big problems. It turns out that this optional semicolon was a misinterpretation of the documentation. The documentation mentions this semicolon and I thought that it was part of the end of the HereDoc. But this semicolon is only allowed because the HereDoc is probably part of a bigger expression that needs to be closed by the semicolon. So I simplified the HereDoc-end and this allowed a nice follow-restriction. All the problems are over and everybody is happy again :)
So there is now a more expressive support for HereDoc, with a restriction. This was already true with the first implementation, but never really mentioned. The problem is that one can not express the fact that the labels of the HereDoc should match within SDF. So HereDoc is parsed from the first open label until the last closing label, even if there are statements in between. This restriction is unfortunate, but reality for all parsers that are based on SDF. It is simply not possible to solve this within the syntax definition. It can be solved by adding a post-processing step, so an issue for this is already created.

The finish this entry with some good news, there are windows-builds for both php-front and php-sat! It took us some time and some hackery implementation of stratego-routines to get everything right, but we succeeded. The hackery stuff will be moved to the stratego libraries as soon as possible. I have tested the build on my own machine and it seems to work fine. Please try it out if you have a windows machine that you can (ab)use. Windows-builds for stratego programs are relatively new, so there could be some hidden problems.

Finding differences

Here is another example that shows that testing is useful. I had implemented the type-juggling from Integer- to String-value and written some tests for this. All the tests succeeded on my machine so I could commit. I was surprised when I got a mail about a failing build. If you look at the page then you will notice that the check passed for i686-linux, but failed for i686-darwin. The tests that fail have in common that they handle a non-empty string that does not start with a number. Thus my code applied string-to-int to an empty string. This strategy is implemented in C and calls strtol. It turns out that the result of this function is not the same for both platforms if the input is an empty-string. I am not a C-expert, but i686-darwin seems to give an error and i686-linux does not do this. The problem is solved and tests are added to the Stratego-libraries for this. But it might be useful to know for others.

These implementation details might be interesting for some, others prefer an update. So what has been added to PHP-SAT? It is now possible to include files in a simple way. This means that all files are included without looking at the context. They can only be included if the file names are coded in Strings, so no concatenation. They should also be on the include-path. But this is the same for PHP.

It is also possible to print the included files. All files are printed to a file with the same file name and a post-fix '.psat'. The bug patterns within PHP-SAT are also applied to the included files, so these are checked automatically.

I am currently improving the simple evaluation to be able to handle more complex file inclusion. It should also improve file-inclusion by having the normal semantics for include_- and require_once.

The last cool thing is that almost all calls to the stratego-xtc-library are gone. This means that we could make windows-builds soon!