Skip to content

C++: New query for 'Cleartext transmission of sensitive information' #6713

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Oct 1, 2021

Conversation

geoffw0
Copy link
Contributor

@geoffw0 geoffw0 commented Sep 17, 2021

Add a new query 'Cleartext transmission of sensitive information' (cpp/cleartext-transmission). This fills a gap in our existing queries 'Cleartext storage of sensitive information in file' (cpp/cleartext-storage-file), 'Cleartext storage of sensitive information in buffer' (cpp/cleartext-storage-buffer) and 'Cleartext storage of sensitive information in an SQLite database' (cpp/cleartext-storage-database) - and allows us to catch the SAMATE Juliet tests cases for CWE-319.


Results on LGTM: https://lgtm.com/query/8348355846815574667/ --- decent number of results, without being noisy on any particular project. Most results look worthwhile, though it's hard to judge which if any are actual security vulnerabilities without digging deep into the design of the software in question. The alibaba/AliSQL is notable as a likely FP, as the path goes through a function encrypt_and_write_file, which presumably does what it says. The result in openwall/john also hints at encryption being present.

Results on SAMATE Juliet tests: https://jenkins.internal.semmle.com/job/Security/job/SAMATE/job/SAMATE-cpp-detailed/163/ --- we catch all of the CWE-319 tests, but unfortunately produce a large number of FPs on that same CWE.

Results on DCA: https://github.com/github/codeql-dca-main/issues/285 --- 448 results in the SAMATE Juliet test, 1 result in git. The latter looks like a reasonable result, and also shows up in the LGTM run (where it's easier to view).


I've given the query @precision medium for now, with the hopes of fixing most of the FPs in a follow-up and upgrading it to high.

@geoffw0 geoffw0 added the C++ label Sep 17, 2021
@geoffw0 geoffw0 requested a review from a team as a code owner September 17, 2021 14:05
Copy link
Contributor

@MathiasVP MathiasVP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good! I'm pretty happy with the code as-is, but I'll give the rest of the team some time to review it as well.

*
* note: functions such as `write` may be writing to a network source or a file. We could attempt to determine which, and sort results into `cpp/cleartext-transmission` and perhaps `cpp/cleartext-storage-file`. In practice it usually isn't very important which query reports a result as long as its reported exactly once.
*/
class NetworkSend extends NetworkSendRecv {
Copy link
Contributor

@MathiasVP MathiasVP Sep 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we replace this whole class with RemoteFlowSinkFunction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Ish...

First, this class is a call to a RemoteFlowSinkFunction/RemoteFlowSourceFunction, not a RemoteFlowSinkFunction/RemoteFlowSourceFunction itself, so I'll still have to wrap those - unless there's a cleaner way to wire it up. But this does avoid duplicating the list of functions, which is a good thing.

Second, RemoteFlowSourceFunction doesn't have an equivalent of getSocketExpr() for ruling out reads from standard input. We could have RemoteFlowSourceFunction expose this information or even do the work to rule out these sources itself?

Third, RemoteFlowSourceFunction also includes getEnv (I'm not sure why, I thought the whole point was to be able to specify sources a bit more precisely). So ideally we'd use something finer that doesn't include that (but Recv is currently private).

Copy link
Contributor

@MathiasVP MathiasVP Sep 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second, RemoteFlowSourceFunction doesn't have an equivalent of getSocketExpr() for ruling out reads from standard input. We could have RemoteFlowSourceFunction expose this information or even do the work to rule out these sources itself?

This seems like a thing we would want to do in RemoteFlowSourceFunction, yeah.

Third, RemoteFlowSourceFunction also includes getEnv (I'm not sure why, I thought the whole point was to be able to specify sources a bit more precisely). So ideally we'd use something finer that doesn't include that (but Recv is currently private).

I think getEnv is only included in LocalFlowSourceFunction, right? That's at least what I see here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, this class is a call to a RemoteFlowSinkFunction/RemoteFlowSourceFunction, not a RemoteFlowSinkFunction/RemoteFlowSourceFunction itself, so I'll still have to wrap those - unless there's a cleaner way to wire it up. But this does avoid duplicating the list of functions, which is a good thing.

We have RemoteFlowSource wrapping RemoteFlowSourceFunction in semmle.code.cpp.models.interfaces.FlowSource, but no equivalent for sinks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we replace this whole class with RemoteFlowSinkFunction?
Could we replace this whole file with RemoteFlowSourceFunction?

I've updated the code to (more-or-less) do this.

The following are considered a RemoteFlowSourceFunction (# = not already modelled by NetworkRecv): recv, recvfrom, recvmsg, read, pread, readv, preadv, preadv2#, gets#, fgets#, fgetws#, getdelim#, getwdelim#, __getdelim#, fread#. Some of the new functions read from a socket, some from a FILE *, and I'm a bit concerned that we'll get overlaps with cpp/cleartext-storage-file from the latter. They're differently structured queries though so I'm not sure this will really be an issue in practice. Probably best to fix it when we see it.

The following are considered a RemoteFlowSinkFunction (all already modelled by NetworkSend): send, sendto, sendmsg, write, writev, pwritev, pwritev2. So no problems there.

We have RemoteFlowSource wrapping RemoteFlowSourceFunction in semmle.code.cpp.models.interfaces.FlowSource, but no equivalent for sinks.

This looks promising, but it's in the security.FlowSources lib which conflicts with dataflow.TaintTracking (and as you point out RemoteFlowSink would need to be added). This PR has enough in it that I won't do this change here, but if you feel strongly I can do it as a follow-up?

Second, RemoteFlowSourceFunction doesn't have an equivalent of getSocketExpr() for ruling out reads from standard input. We could have RemoteFlowSourceFunction expose this information or even do the work to rule out these sources itself?

This seems like a thing we would want to do in RemoteFlowSourceFunction, yeah.

It turns out RemoteFlowSourceFunction can't rule out stdin/stdout by itself, as it's modelling the function not the call and thus has no idea of parameter values. We could potentially do this work in an improved RemoteFlowSource class at some point. But for now I've added a hasSocketInput predicate to the two models so that users can tell which parameter is the socket (if any).

I think getEnv is only included in LocalFlowSourceFunction, right?

Correct, thanks.


TL;DR: I've updated the query to use the RemoteFlowSourceFunction and RemoteFlowSinkFunction models. I believe all points above have been addressed, apart from the possible follow-up PR improving RemoteFlowSource.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the reasons we generally prefer modelling Function rather than Call is that a Function model also applies directly to IR dataflow, whereas a Call model has to be translated. Of course, a Call model can have more details. The hybrid you've arrived at here LGTM.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better (at least, less code) to join the NetworkSend and NetworkReceive parts at the Function level, then look at calls further down. I'll have a go at a refactor...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't make much difference to code quantity or clarity IMO. The awkward bit remains converting from a FunctionInput / FunctionOutput to an Expr or DataFlow::Node. Is there a reason we don't have predicates to do this in FunctionInputsAndOutputs.qll?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK there is no reason why we don't have this logic centralized in FunctionInputsAndOutputs.qll, no. I'll open an issue for it.

@geoffw0
Copy link
Contributor Author

geoffw0 commented Sep 24, 2021

As discussed elsewhere, the plan is to merge this after the feature freeze (Monday).

Comment on lines +100 to +104
// a zero socket descriptor is standard input, which is not interesting for this query.
not exists(Zero zero |
DataFlow::localFlow(DataFlow::exprNode(zero),
DataFlow::exprNode(transmission.getSocketExpr()))
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the same problem present on the transmission side? Writing to stdout and stderr would be file descriptors 1 and 2, respectively.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes absolutely. Addressing this is on my list of things to do in the CWE-319 ticket, but I don't think there's an existing library class to replace Zero with that will just do it. There is some configurability in Nullness.qll (that defines Zero), but it might be better to use GVN or dataflow instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be fine to exclude any constant-valued socket expression. The ones we care about here are always created from calling socket or similar functions, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that's sensible. Not only does it avoid a list of cases, any constant we're unaware of is likely to be something special we shouldn't make assumptions about.

*
* note: functions such as `write` may be writing to a network source or a file. We could attempt to determine which, and sort results into `cpp/cleartext-transmission` and perhaps `cpp/cleartext-storage-file`. In practice it usually isn't very important which query reports a result as long as its reported exactly once.
*/
class NetworkSend extends NetworkSendRecv {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the reasons we generally prefer modelling Function rather than Call is that a Function model also applies directly to IR dataflow, whereas a Call model has to be translated. Of course, a Call model can have more details. The hybrid you've arrived at here LGTM.

@jbj
Copy link
Contributor

jbj commented Sep 27, 2021

I really like the LGTM results for this query even though there are a few FPs. I'm guessing most projects don't realize they've vendored code that will do unencrypted auth to an HTTP server or a proxy server. Maybe that seemed fine 10 years ago.

@geoffw0
Copy link
Contributor Author

geoffw0 commented Oct 1, 2021

Fixed autoformat. There are a couple of things that came up in discussions to be addressed as follow-up, but I believe this PR itself is ready to merge.

Copy link
Contributor

@MathiasVP MathiasVP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (assuming it passes CI)

@MathiasVP MathiasVP merged commit a3cf721 into github:main Oct 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants