Bugzilla – Bug 695
DcfManager::UpdateBackoff () uses slow HighPrecision::Div()
Last modified: 2009-11-30 09:39:58 UTC
UpdateBackoff() function seems to be a bottleneck for wifi-based simulations eating up to 30% in "examples/mesh". The really slow line is (src/devices/wifi/dcf-manager.cc:522 at rev 684768c10c59): Scalar nSlots = (Simulator::Now () - backoffStart) / m_slotTime; This line executes slow HighPrecision division each time the Updatebackoff() function is called (and moreover each for() loop iteration). I see at least 2 possible solutions: 1) retrieve nanoseconds from all involved Time variables and run all calculations using long double precision; 2) Calculate (1 / m_slotTime) once in DcfManager::SetSlot() and use faster 128bit multiplication instead of slow 128bit devision here. Solution #1 theoretically can be less precise than original code, but: 1) GetNanoSeconds () returns uint64_t which won't overflow until 584-year-long simulation which is more than enough; 2) difference (Simulator::Now () - backoffStart) isn't supposed to be very large, so long double easily handles it; 3) ratio ((Simulator::Now () - backoffStart) / m_slotTime) isn't supposed to be very large too, so again long double is enough; 4) nSlots is anyway converted to double, thus losing the HighPrecision. So I think solution #1 isn't less precise in practice. Solution #2 gives less runtime speed up than solution #1, but is still as precise as original code. All unit tests and regression tests are passed for both solutions. Personally I vote for solution #1. More detailed benchmarks and patches will be posted soon.
(In reply to comment #0) > More detailed benchmarks and patches will be posted soon. Static optimized build compiled with gcc-4.3.2 run on Turion 64 X2 TL-60 @ 2Ghz. I've slightly modified the following examples to make them run longer (patch attached). 0) Original code (revision c1bd4ffb5e47) Command being timed: "./build/optimized/examples/mesh" User time (seconds): 182.09 System time (seconds): 0.04 Percent of CPU this job got: 99% Elapsed (wall clock) time (h:mm:ss or m:ss): 3:02.24 Command being timed: "build/optimized/examples/wifi-wired-bridging --nStas=4" User time (seconds): 280.16 System time (seconds): 2.41 Percent of CPU this job got: 98% Elapsed (wall clock) time (h:mm:ss or m:ss): 4:46.7 1) Solution #1 Command being timed: "build/optimized/examples/mesh" User time (seconds): 120.45 System time (seconds): 0.09 Percent of CPU this job got: 99% Elapsed (wall clock) time (h:mm:ss or m:ss): 2:01.58 Command being timed: "build/optimized/examples/wifi-wired-bridging --nStas=4" User time (seconds): 240.95 System time (seconds): 2.87 Percent of CPU this job got: 94% Elapsed (wall clock) time (h:mm:ss or m:ss): 4:18.66 2) Solution #2 Command being timed: "build/optimized/examples/mesh" User time (seconds): 122.12 System time (seconds): 0.08 Percent of CPU this job got: 97% Elapsed (wall clock) time (h:mm:ss or m:ss): 2:05.70 Command being timed: "build/optimized/examples/wifi-wired-bridging --nStas=4" User time (seconds): 243.98 System time (seconds): 3.34 Percent of CPU this job got: 97% Elapsed (wall clock) time (h:mm:ss or m:ss): 4:14.37 So, we're getting a speed up of about 15-30%.
Created attachment 606 [details] Solution #1 (via long doubles)
Created attachment 607 [details] Solution #2 (via TimeInvert)
Created attachment 608 [details] Modifications to examples to measure perfomance gain.
(In reply to comment #0) > UpdateBackoff() function seems to be a bottleneck for wifi-based simulations > eating up to 30% in "examples/mesh". Mathieu, could you explain me why do we need 128 bits for time at all? We always find high precision arithmetics on top of the profiles (oprofile, cachegrind) -- can we try to avoid them using "native" 64 bit time? More globally, I have an impression that ns-3 is too slow to practically simulate O(100) wifi stations with dense traffic -- do you agree? > The really slow line is (src/devices/wifi/dcf-manager.cc:522 at rev > 684768c10c59): > Scalar nSlots = (Simulator::Now () - backoffStart) / m_slotTime; > > This line executes slow HighPrecision division each time the Updatebackoff() > function is called (and moreover each for() loop iteration). > > I see at least 2 possible solutions: > 1) retrieve nanoseconds from all involved Time variables and run all > calculations using long double precision; > 2) Calculate (1 / m_slotTime) once in DcfManager::SetSlot() and use faster > 128bit multiplication instead of slow 128bit devision here. > > Solution #1 theoretically can be less precise than original code, but: > 1) GetNanoSeconds () returns uint64_t which won't overflow until 584-year-long > simulation which is more than enough; > 2) difference (Simulator::Now () - backoffStart) isn't supposed to be very > large, so long double easily handles it; > 3) ratio ((Simulator::Now () - backoffStart) / m_slotTime) isn't supposed to be > very large too, so again long double is enough; > 4) nSlots is anyway converted to double, thus losing the HighPrecision. > So I think solution #1 isn't less precise in practice. > > Solution #2 gives less runtime speed up than solution #1, but is still as > precise as original code. > > All unit tests and regression tests are passed for both solutions. > Personally I vote for solution #1. > > More detailed benchmarks and patches will be posted soon. >
Created attachment 672 [details] solution #3 How about this 3rd solution ?
ns-3-dev: bash-3.2$ time -p ./build/optimized/examples/wireless/wifi-wired-bridging --nStas=4 real 278.54 user 249.31 sys 3.84 bash-3.2$ time -p ./build/optimized/examples/mesh/mesh real 216.39 user 206.39 sys 0.02 ns-3-dev + solution #3: bash-3.2$ time -p ./build/optimized/examples/wireless/wifi-wired-bridging --nStas=4 real 226.54 user 209.00 sys 3.03 bash-3.2$ time -p ./build/optimized/examples/mesh/mesh real 63.63 user 62.24 sys 0.19
(In reply to comment #5) > Mathieu, could you explain me why do we need 128 bits for time at all? We That goes back to a long discussion a long long time ago. To make it short: - we want to make it easy for model developers to get the same simulation results on different platforms with different floating point arithmetic units - hence, we want to encourage avoiding the use of floating point arithmetic - but users still need to make simple calculations which need more precision than simple integer arithmetic - our internal time is 64bits - hence, user calculations need more precision than 64 bits - hence, the easiest thing to do is to give them 64.64 (128) precision. > always find high precision arithmetics on top of the profiles (oprofile, > cachegrind) -- can we try to avoid them using "native" 64 bit time? More In some models, (say, DcfManager), we use Time because it's convenient, not because we need to. i.e., a lot of the manager code could work with simple integer arithmetics. In fact, it was originally written that way. > globally, I have an impression that ns-3 is too slow to practically simulate > O(100) wifi stations with dense traffic -- do you agree? I suspect that if you try to make 100 stations _interfere_, then, yes, you will run into interesting problems but mainly because the underlying interference PHY model is O(n2) and, hopefully, you can easily fix this by using a different PHY model. Otherwise, I would be happy to help fix any optimization issue you have with the wifi models to make them useful to you.
(In reply to comment #6) > Created an attachment (id=672) [details] > solution #3 > > How about this 3rd solution ? Good for me. It gets much better speed gain than others!
can you review it for correctness and push if it is ok ? (all tests appear to pass, but, who knows...)
(In reply to comment #10) > can you review it for correctness and push if it is ok ? (all tests appear to > pass, but, who knows...) I'll try to compare some traces for several (relatively) long-running simulations and then push.
(In reply to comment #10) > can you review it for correctness and push if it is ok ? (all tests appear to > pass, but, who knows...) I've noticed, that your patch and original code will behave differently in several situations, because you've replaced lrint() with integer devision, which works more like floor(). Old code: Simulator::Now() = 27000; backoffStart = 10000; m_slotTime = 9000; nSlots = (Simulator::Now () - backoffStart) / m_slotTime = 1.88888; nIntSlots = lrint (nSlots.GetDouble ()) = 2; New code: Simulator::Now().GetMicroSeconds() = 27; backoffStart.GetMicroSeconds() = 10; m_slotTimeUs = 9; nus = 27 - 10 = 17; nIntSlots = nus / m_slotTimeUs = 17/9 = 1; I don't know, which of the answers is correct according to the standard.
oh, uhm. I suspect that the correct answer is either round to upper always or round to lower integer always, but never round to the closest. I have no idea what we should be doing here though.
Mathieu, thank you for explanations about history of Time. > I suspect that if you try to make 100 stations _interfere_, then, yes, you will > run into interesting problems but mainly because the underlying interference > PHY model is O(n2) and, Yes, now I know this too. Typical profile of large scale wifi simulation I see: samples % image name symbol name 2361908309 68.2932 mesh-sc ns3::InterferenceHelper::GetEnergyDuration(double) 185529257 5.3645 mesh-sc ns3::InterferenceHelper::CalculateNoiseInterferenceW(ns3::Ptr<ns3::InterferenceHelper::Event>, std::vector<ns3::InterferenceHelper::NiChange, std::allocator<ns3::InterferenceHelper::NiChange> >*) const 91399806 2.6428 mesh-sc ns3::DcfManager::GetBackoffStartFor(ns3::DcfState*) 88964737 2.5724 mesh-sc ns3::DcfManager::GetAccessGrantStart() const > hopefully, you can easily fix this by using a different > PHY model. Do you have one? We will try to optimize interference model when we'll have some time (= students). > Otherwise, I would be happy to help fix any optimization issue you > have with the wifi models to make them useful to you. Thank you! We will definitely return to wifi runtime optimizations.
(In reply to comment #13) > oh, uhm. I suspect that the correct answer is either round to upper always or > round to lower integer always, but never round to the closest. > > I have no idea what we should be doing here though. I think, I would go for rounding to upper integer always to avoid situations with nIntSlots == 0 and updating backoffStart to the past (but I'm not sure, that this is bad): Simulator::Now().GetMicroSeconds() = 17; backoffStart.GetMicroSeconds() = 10; m_slotTimeUs = 9; nus = 17 - 10 = 7; nIntSlots = nus / m_slotTimeUs = 7/9 = 0;
(In reply to comment #14) > > hopefully, you can easily fix this by using a different > > PHY model. > > Do you have one? We will try to optimize interference model when we'll have > some time (= students). No, I don't have one but I wish I had time to add one.
(In reply to comment #15) > (In reply to comment #13) > > oh, uhm. I suspect that the correct answer is either round to upper always or > > round to lower integer always, but never round to the closest. > > > > I have no idea what we should be doing here though. > I think, I would go for rounding to upper integer always to avoid situations > with nIntSlots == 0 and updating backoffStart to the past (but I'm not sure, > that this is bad): > Simulator::Now().GetMicroSeconds() = 17; > backoffStart.GetMicroSeconds() = 10; > m_slotTimeUs = 9; > nus = 17 - 10 = 7; > nIntSlots = nus / m_slotTimeUs = 7/9 = 0; I would support whatever you feel is appropriate if you have tested it. So, feel free to commit a version which works for you.
(In reply to comment #15) > (In reply to comment #13) > > oh, uhm. I suspect that the correct answer is either round to upper always or > > round to lower integer always, but never round to the closest. > > > > I have no idea what we should be doing here though. > I think, I would go for rounding to upper integer always to avoid situations > with nIntSlots == 0 and updating backoffStart to the past (but I'm not sure, > that this is bad): > Simulator::Now().GetMicroSeconds() = 17; > backoffStart.GetMicroSeconds() = 10; > m_slotTimeUs = 9; > nus = 17 - 10 = 7; > nIntSlots = nus / m_slotTimeUs = 7/9 = 0; Seems, that I misunderstood the purpose of UpdateBackoff() function. As I was explained to, this function removes already passed slots from backoff. So it's a normal situation, when it removes no slots. Now we're going to push Mathieu's patch and update regression traces (for now, only third example fails regression test).
(In reply to comment #17) > I would support whatever you feel is appropriate if you have tested it. So, > feel free to commit a version which works for you. changeset d3fdc15b065f