Skip to content

Include and path handling optimization #447

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Jul 7, 2025
Merged

Conversation

glankk
Copy link
Contributor

@glankk glankk commented Jun 24, 2025

This is a draft with some suggested changes to how includes are handled. There are several style issues at the moment, but I would like to discuss the overall approach before addressing those.

The changes to the include lookup procedure are based on the observation that files are very often included multiple times, but very rarely included with a different name. To take advantage of this, the lookup procedure first tries to do an optimistic match using the quick and simple path canonicalization provided by simplifyPath. In the rare case that a file is included with a name that hasn't been seen before (due to absoluteness change, capitalization, or link), then and only then is that file compared against the cached files using a more accurate but expensive procedure.

For the case where the file identity cannot be determined with simplifyPath, the file id is used (FILE_ID_INFO on windows, st_dev & st_ino on linux). This combined with the fact that file identity checking is delayed until absolutely necessary means that all calls to realFilename have been removed. This is where most of the timesave comes from.

Using file id means that the lookup procedure has less platform-specific code (only for getting and comparing the file id), and the behavior for case-sensitivty and links is uniform across platforms. The behavior for symlinks matches that of gcc and clang, which is to allows pragma once to work when a file is included with a different name. There's no standard definition for what pragma once should do, but personally I think this is the desired behavior. Intuitively, there are no situations where it is desirable for #pragma once to produce redefinition errors if a file is included multiple times using different names.

The code runs about 2 minutes faster for a large test case on windows with many nested includes (12m40s down from 14m40s), though it's a contrived test case so I'm not sure how representative that is of a real use case.

Some other things;

  • The new procedure requires some additional metadata. Instead of adding more static data in simplecpp, I've opted to place both the token lists and metadata in a new class for the sake of better structure. Other file-specific optimizations could also be facilitated by this such as the multiple-include optimization used by gcc, though using a class is not strictly necessary and does incur a change to the simplecpp interface.
  • Some c++11 features are used, but these could be removed or placed in ifdefs.
  • The rules for how simplecpp should search for headers are not always clear, ideally this would be stated precisely somewhere to make sure that the implementation is correct. Specifically, gcc and clang does not look in the current working directory unless -I. is explicitly passed, but simplecpp does. I would like to clarify whether or not this is intentional or if the behavior should match other compilers.

@danmar
Copy link
Owner

danmar commented Jun 24, 2025

Specifically, gcc and clang does not look in the current working directory unless -I. is explicitly passed, but simplecpp does. I would like to clarify whether or not this is intentional or if the behavior should match other compilers.

It sounds reasonable to match compilers. However gcc will look in current directory in this little test case:

daniel@laptop:~/abc$ cat file1.c
#include "file1.h"
daniel@laptop:~/abc$ cat file1.h

x=1;
daniel@laptop:~/abc$ gcc -E file1.c 
# 0 "file1.c"
# 0 "<built-in>"
# 0 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 0 "<command-line>" 2
# 1 "file1.c"
# 1 "file1.h" 1

x=1;
# 2 "file1.c" 2

@glankk
Copy link
Contributor Author

glankk commented Jun 24, 2025

It sounds reasonable to match compilers. However gcc will look in current directory in this little test case:

This test case works because file1.h is located in the same directory as file1.c. The following case does not work:

glank@penguin:~/abc$ ls -R
.:
file1.h  src

./src:
file1.c
glank@penguin:~/abc$ cat src/file1.c
#include "file1.h"
glank@penguin:~/abc$ cat file1.h
x=1;
glank@penguin:~/abc$ gcc -E src/file1.c
# 0 "src/file1.c"
# 0 "<built-in>"
# 0 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 0 "<command-line>" 2
# 1 "src/file1.c"
src/file1.c:1:10: fatal error: file1.h: No such file or directory
    1 | #include "file1.h"
      |          ^~~~~~~~~
compilation terminated.

However, this works:

glank@penguin:~/abc$ gcc -E -I. src/file1.c
# 0 "src/file1.c"
# 0 "<built-in>"
# 0 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 0 "<command-line>" 2
# 1 "src/file1.c"
# 1 "./file1.h" 1
x=1;
# 2 "src/file1.c" 2

@danmar
Copy link
Owner

danmar commented Jun 25, 2025

This test case works because file1.h is located in the same directory as file1.c. The following case does not work:

I don't remember the motivations. but basically it sounds reasonable to match gcc.

@danmar
Copy link
Owner

danmar commented Jun 25, 2025

I don't mind at all that realFilename is removed. So imho feel free to continue. I would like to see how the tests will be affected..

@glankk glankk marked this pull request as ready for review June 27, 2025 14:51
@glankk glankk force-pushed the optimize_paths branch 3 times, most recently from 06cd050 to 51842f7 Compare June 28, 2025 21:39
@danmar
Copy link
Owner

danmar commented Jul 3, 2025

I would like that you make a draft PR in cppcheck repo which updates simplecpp in cppcheck. It would be interesting to see if all the tests in Cppcheck will also be happy with these changes.

@glankk
Copy link
Contributor Author

glankk commented Jul 3, 2025

I've made some additional changes to the path handling, including a pretty big simplification of how path strings are computed. This means that many helper functions have been able to be removed, which I think makes the code cleaner and easier to follow, and also a bit faster.
Tests are passing in simplecpp and cppcheck. I've made a draft of the cppcheck integration here: danmar/cppcheck#7635

Copy link
Owner

@danmar danmar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits.
I am pretty happy about this and feel that the code is a lot cleaner overall.

Copy link
Owner

@danmar danmar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to get the has_include done first but then I approve this..

@danmar danmar merged commit a0430f3 into danmar:master Jul 7, 2025
16 checks passed
@glankk glankk deleted the optimize_paths branch July 7, 2025 09:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants