Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Glob complexity is Quadratic on directory depth #106

Closed
jaraco opened this issue Jul 14, 2023 · 3 comments · Fixed by #113
Closed

Glob complexity is Quadratic on directory depth #106

jaraco opened this issue Jul 14, 2023 · 3 comments · Fixed by #113

Comments

@jaraco
Copy link
Owner

jaraco commented Jul 14, 2023

In #105, this project re-worked the glob functionality. In that effort, I found that in test_glob_depth, the best complexity was never better than Quadratic. That's why I wrote test_baseline_regex_complexity to show that the regex is Constant on the length of the path, which means it should be linear on a number of paths.

It's probably not important, but I'd like to get a good answer for why the test performance isn't better than Quadratic.

jaraco added a commit that referenced this issue Jul 14, 2023
@jaraco jaraco changed the title Glob complexity is Quadratic Glob complexity is Quadratic on directory depth Jul 14, 2023
@jaraco
Copy link
Owner Author

jaraco commented Jul 14, 2023

@nh2 Perhaps you'd be interested to take a look and see if you can understand why the performance is quadratic.

@jaraco
Copy link
Owner Author

jaraco commented Mar 13, 2024

The complexity appears to be coming from the call to the zipfile namelist. If I add this patch:

 zipp main @ git diff
diff --git a/tests/test_complexity.py b/tests/test_complexity.py
index 67e9c17..7b91505 100644
--- a/tests/test_complexity.py
+++ b/tests/test_complexity.py
@@ -39,7 +39,9 @@ class TestComplexity(unittest.TestCase):
         for path, name in pairs:
             zf.writestr(f"{path}{name}.txt", b'')
         zf.filename = "big un.zip"
-        return zipp.Path(zf)
+        res = zipp.Path(zf)
+        res._saved_namelist = res.root.namelist()
+        return res
 
     @classmethod
     def make_names(cls, width, letters=string.ascii_lowercase):
@@ -81,6 +83,7 @@ class TestComplexity(unittest.TestCase):
             max_n=100,
             min_n=1,
         )
+        breakpoint()
         assert best <= big_o.complexities.Quadratic
 
     @pytest.mark.flaky
diff --git a/zipp/__init__.py b/zipp/__init__.py
index a1b9884..e62dc05 100644
--- a/zipp/__init__.py
+++ b/zipp/__init__.py
@@ -399,7 +399,7 @@ class Path:
         prefix = re.escape(self.at)
         tr = Translator(seps='/')
         matches = re.compile(prefix + tr.translate(pattern)).fullmatch
-        return map(self._next, filter(matches, self.root.namelist()))
+        return map(self._next, filter(matches, self._saved_namelist))
 
     def rglob(self, pattern):
         return self.glob(f'**/{pattern}')

The result comes back as Constant time (in one test; it's probably Linear).

@jaraco
Copy link
Owner Author

jaraco commented Mar 13, 2024

The problem is that ZipFile.namelist constructs a new list, which is apparently quadratic in the length of the filelist. Bypassing that list construction restores the expectation of linear or better performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant