Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An Ubjson Parsing Problem Can Easily Cause DDoS Attack. #2793

Open
3 of 5 tasks
spidermana opened this issue Jun 1, 2021 · 14 comments
Open
3 of 5 tasks

An Ubjson Parsing Problem Can Easily Cause DDoS Attack. #2793

spidermana opened this issue Jun 1, 2021 · 14 comments
Labels
aspect: binary formats BSON, CBOR, MessagePack, UBJSON state: please discuss please discuss the issue or vote for your favorite option state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated

Comments

@spidermana
Copy link

What is the issue you have?

In the process of parsing ubjson and converting it to json with the from_ubjson function, parsing an extremely long array of null characters (unlike other characters, null characters do not need to be supplied by the user in the input) can lead to a “dead” loop or loop explosion problem.

In my initial experiments, taking just 10 characters as input can cause the function from_ubjson to execute for over 150s (even with -O3 enabled) and use over 35G of memory in my server. Besides, the effect of such an attack is cumulative. That is, if this 10-character string is copied n times, it can generates 150n seconds of execution time, and 35nG memory usage.

Any host that uses the correlation function is vulnerable to a extremely serious DoS attack for the unreasonably high usage of CPU and memory resources

The stack traces and error messages are as follows:

	……
    #12 0x5957e7 in emplace_back<nullptr_t> /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/vector.tcc:121:4
    #13 0x5957e7 in nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer>* nlohmann::detail::json_sax_dom_parser<nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer> >::handle_value<std::nullptr_t>(std::nullptr_t&&) /root/JSON_BUILD/single_include/nlohmann/json.hpp:2981:46
    #14 0x5ae444 in null /root/JSON_BUILD/single_include/nlohmann/json.hpp:2849:9
    #15 0x5ae444 in nlohmann::detail::binary_reader<nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer>, nlohmann::detail::json_sax_dom_parser<nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer> > >::get_ubjson_value(int) /root/JSON_BUILD/single_include/nlohmann/json.hpp:5091:29
    #16 0x5b1eb9 in nlohmann::detail::binary_reader<nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer>, nlohmann::detail::json_sax_dom_parser<nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer> > >::get_ubjson_array() /root/JSON_BUILD/single_include/nlohmann/json.hpp:5195:29
    #17 0x5ad8be in nlohmann::detail::binary_reader<nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer>, nlohmann::detail::json_sax_dom_parser<nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer> > >::get_ubjson_value(int) /root/JSON_BUILD/single_include/nlohmann/json.hpp:5158:24
    #18 0x5b1faa in parse_ubjson_internal /root/JSON_BUILD/single_include/nlohmann/json.hpp:4889:16
    #19 0x5b1faa in nlohmann::detail::binary_reader<nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer>, nlohmann::detail::json_sax_dom_parser<nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer> > >::get_ubjson_array() /root/JSON_BUILD/single_include/nlohmann/json.hpp:5222:21
    #20 0x5ad8be in nlohmann::detail::binary_reader<nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer>, nlohmann::detail::json_sax_dom_parser<nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer> > >::get_ubjson_value(int) /root/JSON_BUILD/single_include/nlohmann/json.hpp:5158:24
    #21 0x597a80 in parse_ubjson_internal /root/JSON_BUILD/single_include/nlohmann/json.hpp:4889:16
    #22 0x597a80 in nlohmann::detail::binary_reader<nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer>, nlohmann::detail::json_sax_dom_parser<nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer> > >::sax_parse(nlohmann::detail::input_format_t, nlohmann::detail::json_sax_dom_parser<nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer> >*, bool) /root/JSON_BUILD/single_include/nlohmann/json.hpp:3599:26
    #23 0x5578ba in nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer>::from_ubjson(nlohmann::detail::input_adapter&&, bool, bool) /root/JSON_BUILD/single_include/nlohmann/json.hpp:19829:66

Input: 0x22,0x37,0x2c,0x23,0x69,0x69,0x6f,0x24,0x6f,0x6f,0x6f,0x6f,0x6f,0x6f,0x6f,0x6f,0x6f,0x6f,0x6f,0x6f,0x6f,0x2c,0x23,0x2d,0x69,0x69,0x69,0x3a,0x69,0x5b,0x5b,0x24,0x5a,0x23,0x69,0x3a,0x69,0x5b,0x5b,0x24,0x5a,0x23,0x69,0x5b,0x5b,0x24,0x5a,0x23,0x6c,0x3a,0x69,0x5b,0x5b,0x24,0x5a,0x23,0x69,0x69,0x22,
\"7,#iio$ooooooooooooo,#-iii:i[[$Z#i:i[[$Z#i[[$Z#l:i[[$Z#ii\"
Error: out-of-memory and TLE

Please describe the steps to reproduce the issue.

  1. Download load test case from here.
    2. Code to parse the ubjson from here.
    3. Compile using command: g++ -O3 -std=c++11 -g -I <JSON_SRC_PATH>/single_include ./test_ubjson.cpp

Run a.out and check the CPU and memory usage.

Can you provide a small but working code example?

Please see above.

What is the expected behavior?

It should parse null very quickly for very small inputs.

And what is the actual behavior instead?

The from_ubjson function has extremely high memory and CPU usage when parsing pure null characters with large length fields in UBJSON format.

In UBJSON, for null characters, the user does not need to actually enter the character.

However, according to ubjson's compression logic, the user only needs to provide a very large length field of the null character when parsing, which can make from_ubjson take a long time to finish.

for (std::size_t i = 0; i < size_and_type.first; ++i)
{
if (JSON_HEDLEY_UNLIKELY(!get_ubjson_value(size_and_type.second)))
{
return false;
}
}

In practice, the loop for get_ubjson_array should be optimized (using resize() etc. to speed it up), or a special structure should be used to record the number and the position of null character, rather than just pushing it into memory, and only rewrite the answer to the standard output directly when converting to the json format.

Which compiler and operating system are you using?

  • Compiler: g++
  • Operating system: ubuntu
$ uname -a
Linux spider 5.4.114-1-pve #1 SMP PVE 5.4.114-1 (Sun, 09 May 2021 17:13:05 +0200) x86_64 x86_64 x86_64 GNU/Linux
$ g++ --version                                                                   
g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

Which version of the library did you use?

  • latest release version 3.9.1
  • other release - please state the version: not checked, maybe_
  • the develop branch

If you experience a compilation error: can you compile and run the unit tests?

  • yes
  • no - please copy/paste the error message below
@nlohmann
Copy link
Owner

nlohmann commented Jun 1, 2021

Do I understand your issue right that the "attack" would be parsing a UBJSON array that is encoded like "an array with 100000 null values" which, due to the compact encoding indeed requires only little bytes on the UBJSON side, but yields a large overhead when represented as JSON?

I think using resize is unsafe, because it could lead to immediate large memory allocations based on untrusted input.

@nlohmann nlohmann added the aspect: binary formats BSON, CBOR, MessagePack, UBJSON label Jun 1, 2021
@spidermana
Copy link
Author

Do I understand your issue right that the "attack" would be parsing a UBJSON array that is encoded like "an array with 100000 null values" which, due to the compact encoding indeed requires only little bytes on the UBJSON side, but yields a large overhead when represented as JSON?

You get it.

In reality, there should not be such a large difference between the two formats.
The original parsing logic is not wrong, but this special input would make the original design behave badly which could be easily exploited by an attacker.

After all, parsing just 10 characters requires over 35 gigabytes memory usage.

This is not the worst case. Since my server has 64G RAM, I only use "[$Z#l1~~~~" as the test case, otherwise the program would be killed before finish due to the lack of memory.

If the input was modified to "[$Z#l~~~~", then it is possible that these 10 bytes would consume over 100G memory.

Possible solutions now might be:

  1. A simple fix: using resize() etc. The current loop for handling null characters is to push null characters one by one, which can be very inefficient. However, resize() can slightly improve the speed of parsing. But the maximum of memory usage is the same as the original loop.
    Of course, resize() is just a small temporary fix. (not recommended)

mprof of resize() v.s. original

  1. The best way is to implement a better data structure (using few fields such as is_null and size to "store" the array of null characters) and modify the relevant functions (e.g. modify the logic of array fetching) to guarantee the stability of the ABI. (more code and time required)

@nlohmann
Copy link
Owner

nlohmann commented Jun 2, 2021

I'm not sure whether adding code for such special cases is a good idea as it adds complexity for very artificial inputs.

@nlohmann nlohmann added state: please discuss please discuss the issue or vote for your favorite option and removed kind: bug labels Jun 2, 2021
@spidermana
Copy link
Author

Thanks for your reply!

Obviously, the solution is not limited to the two options I suggested above.

As the creator of this repository, you have a much better understanding of this work than I do, and I am confident that you can find a solution that balances complexity and security.

In addition, I still want to emphasise two points on this issue and wish to get the support from you.

  • Based on the previous statement, the logic is indeed exploitable and the technical barrier of exploit is very low.
  • JSON for Modern C++ is a really amazing work and has many users. You have highlighted design goals in the README.md, especially high quality with serious testing and memory efficiency. As you believe, if maintainers guarantee the security, this work will definitely be supported by more users.

Looking forward to your response and repair.

@nlohmann
Copy link
Owner

nlohmann commented Jun 4, 2021

I just tried py-ubjson which shows the same behavior:

import ubjson
encoded = b'[$Z#L\x00\x00\x00\x00\xff\xff\xff\xff'
decoded = ubjson.loadb(encoded)

(Python allocates some 30 GB of memory).

@SunHao-0
Copy link

SunHao-0 commented Jun 4, 2021

@nemequ

I just tried py-ubjson which shows the same behavior

I don't think we should ignore the problem just because py-ubjson didn't handle such a special case.
The fact is, in my opinion, this special case is very interesting and should be handled properly; otherwise, people who're using ubjson as underlining serializing&deserializing format would face terrible disaster as @spidermana mentioned above:

@spidermana: After all, parsing just 10 characters requires over 35 gigabytes of memory usage.

The backend server would be DOS due to the huge memory cost and high cpu utilization.

I think the second solution mentioned by @spidermana to this problem is feasible. We can do better by using some kinds of marking in array object, in this way, both time and memory consumption can be reduced. However, this takes ABI stability problem to us and I didn't come up with a solution for this. How do you think about this problem? @nlohmann

@nlohmann
Copy link
Owner

nlohmann commented Jun 4, 2021

I would be happy to see a proposal, but it is, in my opinion, not so easy. The library is designed to store arbitrary JSON, and your proposal is to change the storage to simplify storing very specialized use cases.

@spidermana
Copy link
Author

spidermana commented Jun 4, 2021

Based on your suggestion, I have also tried py-ubjson.

The test case('[$Z#L\x00\x00\x00\x00\xff\xff\xff\xff') you provided is actually different from the one I used before('[$Z#l\x31\x7E\x7E\x7E'). You should compare the difference between the two under the same input.

For the '[$Z#L\x00\x00\x00\x00\x00\x00\xff\xff\xff\xff' input case, the current JSON for Modern C++ implementation probably uses much more than 150G memory and 2000s+of CPU time, but py-ubjson only requires 30G of RAM and 30s+ of CPU time. Since my server does not have enough RAM, I can not obtain the specific data of this experiment under this test case.

But for the test case '[$Z#l1~~~', i.e. '[$Z#l\x31\x7E\x7E\x7E', the current JSON for Modern C++ implementation would use over 35G of memory and nearly 600s of CPU time (Please refer to the picture here), but with py-ubjson, it takes just under 7s of CPU time and 7G of memory.

image

The worse the input data, the greater the difference between py-ubjson and the current implementation of JSON for Modern C++. Perhaps you can find some inspiration in py-ubjson, which may help you find a balance between security and complexity in fixing this problem.

In fact, the attacker will not use arbitrary input, but instead make use of very specialized input, like I mentioned above

This issue should not be underestimated. I sincerely hope you can solve this problem to improve the security.

@gregmarr
Copy link
Contributor

I haven't done this myself, just speculating. Would it be possible for the caller to create their own class derived from the json_sax_dom_parser and simply return false from start_array if elements is greater than some predefined maximum size? This would mean that they'd have to do all of from_ubjson themselves as that doesn't have an option to replace the parser. If that works, then perhaps the max array size could be a configuration parameter of the object or type, or a parameter passed to the read function? I seem to recall a max level of nesting was implemented at some point due to a stack overflow with a specially crafted JSON file, but I can't find it now, so maybe that was moved into a custom parser capability, with the depth parameter.

@gregmarr
Copy link
Contributor

In practice, the loop for get_ubjson_array should be optimized (using resize() etc. to speed it up),

Using reserve() would be good, but would somehow need to preserve the existing behavior of not allocating memory for the array if get_ubjson_value doesn't add anything to the array because of a custom parser.

@nickaein
Copy link
Contributor

nickaein commented Aug 5, 2021

A workaround is to provide a low-cost helper/utility function that calculates the effective size of UBJSON raw input without actually constructing it. A simple metric as total number of elements seems easy to implement and probably more than enough for rejecting such malicious inputs.

Based on use-case, the user can decide to call this function for the given input and put a hard limit on the number of elements as a security measure.

I'm not familiar with UBJSON format so it would be very helpful to hear your feedback and on viability of this approach.

@t-b
Copy link
Contributor

t-b commented Aug 5, 2021

I seem to recall a max level of nesting was implemented at some point due to a stack overflow with a specially crafted JSON file, but I can't find it now, so maybe that was moved into a custom parser capability, with the depth parameter.

This was probably #1788 with a fix in #1436. This fix made parse be only limited by available memory.

If I understand this issue here correctly this is about preventing parsing input which is small (in terms of memory size) on the attacker side but large on the json library side, or?

@spidermana
Copy link
Author

If I understand this issue here correctly this is about preventing parsing input which is small (in terms of memory size) on the attacker side but large on the json library side, or?

Your understanding is right.

Besides, the @nickaein solution prevents malicious input and is easy to implement, but may lose some of its usability.
However, the py-ubjson implementation can better optimise memory usage and processing time, but the overall implementation cost is much higher.

Each has both pros and cons.
Hope to fix it.

@stale
Copy link

stale bot commented Jan 9, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated label Jan 9, 2022
fangq added a commit to NeuroJSON/json that referenced this issue Mar 2, 2022
fangq added a commit to NeuroJSON/json that referenced this issue Apr 4, 2022
fangq added a commit to NeuroJSON/json that referenced this issue Apr 23, 2022
fangq added a commit to NeuroJSON/json that referenced this issue Apr 23, 2022
nlohmann pushed a commit that referenced this issue Apr 29, 2022
* support UBJSON-derived Binary JData (BJData) format

* fix Codacy warning

* partially fix VS compilation errors

* fix additional VS errors

* fix more VS compilation errors

* fix additional warnings and errors for clang and msvc

* add more tests to cover the new bjdata types

* add tests for optimized ndarray, improve coverage, fix clang/gcc warnings

* gcc warn useless conversion but msvc gives an error

* fix ci_test errors

* complete test coverage, fix ci_test errors

* add half precision error test

* fix No newline at end of file error by clang

* simplify endian condition, format unit-bjdata

* remove broken test due to alloc limit

* full coverage, I hope

* move bjdata new markers from default to the same level as ubjson markers

* fix ci errors, add tests for new bjdata switch structure

* make is_bjdata const after using initializer list

* remove the unwanted assert

* move is_bjdata to an optional param to write_ubjson

* pass use_bjdata via output adapter

* revert order to avoid msvc 2015 unreferenced formal param error

* update BJData Spect V1 Draft-2 URL after spec release

* amalgamate code

* code polishing following @gregmarr's feedback

* make use_bjdata a non-default parameter

* fix ci error, remove unwanted param comment

* encode and decode bjdata ndarray in jdata annotations, enable roundtrip tests

* partially fix ci errors, add tests to improve coverage

* polish patch to remove ci errors

* fix a ndarray dim vector condition

* fix clang tidy error

* add sax test cases for ndarray

* add additional sax event tests

* adjust sax event numbering

* fix sax tests

* ndarray can only be used with array containers, discard if used in object

* complete test coverage

* disable [{SHTFNZ in optimized type due to security risks in #2793 and hampered readability

* fix ci error

* move OutputIsLittleEndian from tparam to param to replace use_bjdata

* fix ci clang gcc error

* fix ci static analysis error

* update json_test_data to 3.1.0, enable file-based bjdata unit tests

* fix stack overflow error on msvc 2019 and 2022

* use https link, update sax_parse_error after rebase

* make input_format const and use initializer

* return bool for write_bjdata_ndarray

* test write_bjdata_ndarray return value as boolean

* fix ci error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aspect: binary formats BSON, CBOR, MessagePack, UBJSON state: please discuss please discuss the issue or vote for your favorite option state: stale the issue has not been updated in a while and will be closed automatically soon unless it is updated
Projects
None yet
Development

No branches or pull requests

6 participants