-
Notifications
You must be signed in to change notification settings - Fork 1.7k
C++: New query for 'Cleartext transmission of sensitive information' #6713
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…Ps in SAMATE Juliet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good! I'm pretty happy with the code as-is, but I'll give the rest of the team some time to review it as well.
| * | ||
| * note: functions such as `write` may be writing to a network source or a file. We could attempt to determine which, and sort results into `cpp/cleartext-transmission` and perhaps `cpp/cleartext-storage-file`. In practice it usually isn't very important which query reports a result as long as its reported exactly once. | ||
| */ | ||
| class NetworkSend extends NetworkSendRecv { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we replace this whole class with RemoteFlowSinkFunction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Ish...
First, this class is a call to a RemoteFlowSinkFunction/RemoteFlowSourceFunction, not a RemoteFlowSinkFunction/RemoteFlowSourceFunction itself, so I'll still have to wrap those - unless there's a cleaner way to wire it up. But this does avoid duplicating the list of functions, which is a good thing.
Second, RemoteFlowSourceFunction doesn't have an equivalent of getSocketExpr() for ruling out reads from standard input. We could have RemoteFlowSourceFunction expose this information or even do the work to rule out these sources itself?
Third, RemoteFlowSourceFunction also includes getEnv (I'm not sure why, I thought the whole point was to be able to specify sources a bit more precisely). So ideally we'd use something finer that doesn't include that (but Recv is currently private).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second,
RemoteFlowSourceFunctiondoesn't have an equivalent ofgetSocketExpr()for ruling out reads from standard input. We could haveRemoteFlowSourceFunctionexpose this information or even do the work to rule out these sources itself?
This seems like a thing we would want to do in RemoteFlowSourceFunction, yeah.
Third,
RemoteFlowSourceFunctionalso includesgetEnv(I'm not sure why, I thought the whole point was to be able to specify sources a bit more precisely). So ideally we'd use something finer that doesn't include that (butRecvis currently private).
I think getEnv is only included in LocalFlowSourceFunction, right? That's at least what I see here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, this class is a call to a
RemoteFlowSinkFunction/RemoteFlowSourceFunction, not aRemoteFlowSinkFunction/RemoteFlowSourceFunctionitself, so I'll still have to wrap those - unless there's a cleaner way to wire it up. But this does avoid duplicating the list of functions, which is a good thing.
We have RemoteFlowSource wrapping RemoteFlowSourceFunction in semmle.code.cpp.models.interfaces.FlowSource, but no equivalent for sinks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we replace this whole class with
RemoteFlowSinkFunction?
Could we replace this whole file with RemoteFlowSourceFunction?
I've updated the code to (more-or-less) do this.
The following are considered a RemoteFlowSourceFunction (# = not already modelled by NetworkRecv): recv, recvfrom, recvmsg, read, pread, readv, preadv, preadv2#, gets#, fgets#, fgetws#, getdelim#, getwdelim#, __getdelim#, fread#. Some of the new functions read from a socket, some from a FILE *, and I'm a bit concerned that we'll get overlaps with cpp/cleartext-storage-file from the latter. They're differently structured queries though so I'm not sure this will really be an issue in practice. Probably best to fix it when we see it.
The following are considered a RemoteFlowSinkFunction (all already modelled by NetworkSend): send, sendto, sendmsg, write, writev, pwritev, pwritev2. So no problems there.
We have
RemoteFlowSourcewrappingRemoteFlowSourceFunctioninsemmle.code.cpp.models.interfaces.FlowSource, but no equivalent for sinks.
This looks promising, but it's in the security.FlowSources lib which conflicts with dataflow.TaintTracking (and as you point out RemoteFlowSink would need to be added). This PR has enough in it that I won't do this change here, but if you feel strongly I can do it as a follow-up?
Second,
RemoteFlowSourceFunctiondoesn't have an equivalent ofgetSocketExpr()for ruling out reads from standard input. We could haveRemoteFlowSourceFunctionexpose this information or even do the work to rule out these sources itself?
This seems like a thing we would want to do in
RemoteFlowSourceFunction, yeah.
It turns out RemoteFlowSourceFunction can't rule out stdin/stdout by itself, as it's modelling the function not the call and thus has no idea of parameter values. We could potentially do this work in an improved RemoteFlowSource class at some point. But for now I've added a hasSocketInput predicate to the two models so that users can tell which parameter is the socket (if any).
I think
getEnvis only included inLocalFlowSourceFunction, right?
Correct, thanks.
TL;DR: I've updated the query to use the RemoteFlowSourceFunction and RemoteFlowSinkFunction models. I believe all points above have been addressed, apart from the possible follow-up PR improving RemoteFlowSource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the reasons we generally prefer modelling Function rather than Call is that a Function model also applies directly to IR dataflow, whereas a Call model has to be translated. Of course, a Call model can have more details. The hybrid you've arrived at here LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better (at least, less code) to join the NetworkSend and NetworkReceive parts at the Function level, then look at calls further down. I'll have a go at a refactor...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't make much difference to code quantity or clarity IMO. The awkward bit remains converting from a FunctionInput / FunctionOutput to an Expr or DataFlow::Node. Is there a reason we don't have predicates to do this in FunctionInputsAndOutputs.qll?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK there is no reason why we don't have this logic centralized in FunctionInputsAndOutputs.qll, no. I'll open an issue for it.
|
As discussed elsewhere, the plan is to merge this after the feature freeze (Monday). |
| // a zero socket descriptor is standard input, which is not interesting for this query. | ||
| not exists(Zero zero | | ||
| DataFlow::localFlow(DataFlow::exprNode(zero), | ||
| DataFlow::exprNode(transmission.getSocketExpr())) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the same problem present on the transmission side? Writing to stdout and stderr would be file descriptors 1 and 2, respectively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes absolutely. Addressing this is on my list of things to do in the CWE-319 ticket, but I don't think there's an existing library class to replace Zero with that will just do it. There is some configurability in Nullness.qll (that defines Zero), but it might be better to use GVN or dataflow instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be fine to exclude any constant-valued socket expression. The ones we care about here are always created from calling socket or similar functions, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that's sensible. Not only does it avoid a list of cases, any constant we're unaware of is likely to be something special we shouldn't make assumptions about.
| * | ||
| * note: functions such as `write` may be writing to a network source or a file. We could attempt to determine which, and sort results into `cpp/cleartext-transmission` and perhaps `cpp/cleartext-storage-file`. In practice it usually isn't very important which query reports a result as long as its reported exactly once. | ||
| */ | ||
| class NetworkSend extends NetworkSendRecv { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the reasons we generally prefer modelling Function rather than Call is that a Function model also applies directly to IR dataflow, whereas a Call model has to be translated. Of course, a Call model can have more details. The hybrid you've arrived at here LGTM.
|
I really like the LGTM results for this query even though there are a few FPs. I'm guessing most projects don't realize they've vendored code that will do unencrypted auth to an HTTP server or a proxy server. Maybe that seemed fine 10 years ago. |
Co-authored-by: Jonas Jensen <jbj@github.com>
|
Fixed autoformat. There are a couple of things that came up in discussions to be addressed as follow-up, but I believe this PR itself is ready to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! (assuming it passes CI)
Add a new query 'Cleartext transmission of sensitive information' (
cpp/cleartext-transmission). This fills a gap in our existing queries 'Cleartext storage of sensitive information in file' (cpp/cleartext-storage-file), 'Cleartext storage of sensitive information in buffer' (cpp/cleartext-storage-buffer) and 'Cleartext storage of sensitive information in an SQLite database' (cpp/cleartext-storage-database) - and allows us to catch the SAMATE Juliet tests cases for CWE-319.Results on LGTM: https://lgtm.com/query/8348355846815574667/ --- decent number of results, without being noisy on any particular project. Most results look worthwhile, though it's hard to judge which if any are actual security vulnerabilities without digging deep into the design of the software in question. The
alibaba/AliSQLis notable as a likely FP, as the path goes through a functionencrypt_and_write_file, which presumably does what it says. The result inopenwall/johnalso hints at encryption being present.Results on SAMATE Juliet tests: https://jenkins.internal.semmle.com/job/Security/job/SAMATE/job/SAMATE-cpp-detailed/163/ --- we catch all of the CWE-319 tests, but unfortunately produce a large number of FPs on that same CWE.
Results on DCA: https://github.com/github/codeql-dca-main/issues/285 --- 448 results in the SAMATE Juliet test, 1 result in
git. The latter looks like a reasonable result, and also shows up in the LGTM run (where it's easier to view).I've given the query
@precision mediumfor now, with the hopes of fixing most of the FPs in a follow-up and upgrading it tohigh.