I'm trying to understand how does LTE and UMTS/HSPA+ transition away from the full power state, and what kind of resources are tied up at the tower when the radios of the mobile device are operating at full power.
I've looked at Ch. 7 and 8 of High Performance Browser Networking, and also at an AT&T Research article mentioned in the book, but they don't address this question directly.
Is the timeout for transition from full-power to half-power in UMTS (DCH to FACH) almost always 5s, or where does the 5s value (mentioned in the AT&T Research link above) come from?
Is the UMTS timeout for transition away from the full power DCH state reset when minor traffic is sent prior to the expiration of the timeout, or does it depend on whether it would be deemed sufficient for such minor traffic to be subsequently handled through the shared low-speed channel in the low-bandwidth half-power FACH state?
What's the timeout in LTE for the transition away from the full-power state?
What resources are tied up from the tower in UMTS and LTE full power states, with what implications for the carrier?
How much the transition away from the full-power states is dictated by the battery consumption concerns of the mobile device, as opposed to the actual resource conservation of the towers by the carrier? For example, if the device is connected to the charger, would it ever be allowed or make sense to always operate the radio of the mobile device in the full power state with UMTS and LTE?
Its a weird claim that RRC state from DCH to FACH will take 5s, it is usually faster than that. The longer it hangs the more RRC resources the network ties to your RRC instance, it is in the best interest of good design to make the RRC state hang time as short as possible as it saves computing and spectral resource.
Thus going back to your main question RRC STATE (CELL DCH) consumes the most power, RRC STATE (CELL FACH) consumes bursts of power and RRC STATE (IDLE) consumes the least. The burst comes from cell reselection states and RRC connect establishment requests.
Here is a proper rrc state diagram (http://images.books24x7.com/bookimages/id_6399/fig209_01.jpg)
Here is an rrc state power consumption graph i found on google image (http://3.bp.blogspot.com/-NoMR5oNLbCs/T3H1i0bsdgI/AAAAAAAAAW0/pv0G-tG0auk/s1600/Power+Consumption+Vs+RRC+states.png)
Now if the article's data is correct, what I can deduce is that the measurements were made that the RRC state machine in the Ue was in "hysteresis" that it took the Ue 5s to decide on the next RRC state. Then it may be a network design and degradation issue.
UMTS:
The delay from transitioning to DCH to FACH is governed by a timer called T1, and in this case the network has configured it to be 5 seconds. Whatever the value is, it is a compromise between device battery consumption and managing the signalling load between the network elements.
For mobile applications that exchange small packets periodically but infrequently a long timer causes the device to remain in high power state for many additional and unnecessary seconds, draining battery.
Prior to 3GPP Release 8 this issue was addressed by device manufacturers so that instead of waiting for the network initiated transition to FACH, the device would send a signalling connection release indication (SCRI) after it was done sending and receiving data. This would place the device in idle mode, the lowest power consumption state.
However, this solution had an downfall: the SCRI would cause unnecessary signalling load between the network elements when the RAB was being released and set up again frequently.
This was addressed in Release 8 so that a specific cause value (UE requested PS data session end) in SCRI explicitly states to the network that the device is done sending and receiving data. This allows the network to distinguish between different reasons for releasing the connection, and if it is happening too frequently, it could deny the request to release the connection and avoid signalling load.
See Fast Dormancy Best Practices by GSMA for more info.
LTE:
LTE is simpler as there are only two RRC states, connected and idle. The timers are still controlled by the network, but remaining in RRC connected state in LTE is not as harmful to the UE as discontinuous reception (DRX) helps to keep the power consumption lower. Also, transitioning between the two states does not cause as much signalling load in LTE, as it was a goal of the design.
RRC-DCH state is always have highest power state.
The RRC-FACH state is lower than the DCH state. Then comes the RRC-URA state which is least of them all.
The Fast Dormancy helps to switch the Ue from highest to Lowest power state.
In Power Level DCH>FACH>URA in connected mode.
Related
We are experiencing a very difficult-to-observe problem with our Flink job.
The Job is reasonably simple, it:
Reads messages from Kinesis using the Flink Kinesis connector
Keys the messages and distributes them to ~30 different CEP operators, plus a couple of custom WindowFunctions
The messages emitted from the CEP/Windows are forward to a SinkFunction that writes messages to SQS
We are running Flink 1.10.1 Fargate, using 2 containers with 4vCPUs/8GB, we are using the RocksDB state backend with the following configuration:
state.backend: rocksdb
state.backend.async: true
state.backend.incremental: false
state.backend.rocksdb.localdir: /opt/flink/rocksdb
state.backend.rocksdb.ttl.compaction.filter.enabled: true
state.backend.rocksdb.files.open: 130048
The job runs with a parallelism of 8.
When the job starts from cold, it uses very little CPU and checkpoints complete in 2 sec. Over time, the checkpoint sizes increase but the times are still very reasonable couple of seconds:
During this time we can observe the CPU usage of our TaskManagers gently growing for some reason:
Eventually, the checkpoint time will start spiking to a few minutes, and then will just start repeatedly timing out (10 minutes). At this time:
Checkpoint size (when it does complete) is around 60MB
CPU usage is high, but not 100% (usually around 60-80%)
Looking at in-progress checkpoints, usually 95%+ of operators complete the checkpoint with 30 seconds, but a handful will just stick and never complete. The SQS sink will always be included on this, but the SinkFunction is not rich and has no state.
Using the backpressure monitor on these operators reports a HIGH backpressure
Eventually this situation resolves one of 2 ways:
Enough checkpoints fail to trigger the job to fail due to a failed checkpoint proportion threshold
The checkpoints eventually start succeeding, but never go back down to the 5-10s they take initially (when the state size is more like 30MB vs. 60MB)
We are really at a loss at how to debug this. Our state seems very small compared to the kind of state you see in some questions on here. Our volumes are also pretty low, we are very often under 100 records/sec.
We'd really appreciate any input on areas we could look into to debug this.
Thanks,
A few points:
It's not unusual for state to gradually grow over time. Perhaps your key space is growing, and you are keeping some state for each key. If you are relying on state TTL to expire stale state, perhaps it is not configured in a way that allows it clean up expired state as quickly as you would expect. It's also relatively easy to inadvertently create CEP patterns that need to keep some state for a very long time before certain possible matches can be ruled out.
A good next step would be to identify the cause of the backpressure. The most common cause is that a job doesn't have adequate resources. Most jobs gradually come to need more resources over time, as the number of users (for example) being managed rises. For example, you might need to increase the parallelism, or give the instances more memory, or increase the capacity of the sink(s) (or the speed of the network to the sink(s)), or give RocksDB faster disks.
Besides being inadequately provisioned, other causes of backpressure include
blocking i/o is being done in a user function
a large number of timers are firing simultaneously
event time skew between different sources is causing large amounts of state to be buffered
data skew (a hot key) is overwhelming one subtask or slot
lengthy GC pauses
contention for critical resources (e.g., using a NAS as the local disk for RocksDB)
Enabling RocksDB native metrics might provide some insight.
Add this property to your configuration:
state.backend.rocksdb.checkpoint.transfer.thread.num: {threadNumberAccordingYourProjectSize}
if you do not add this , it will be 1 (default)
Link: https://github.com/apache/flink/blob/master/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/RocksDBOptions.java#L62
I have a gen_server process that maintains a pool, for each incoming request, I need to examining the pool to see if there is a match for this incoming request, if there is one, the matched one is removed from the pool and replies are made to both requests; if there is none, the new request is put to the pool for later examination.
The biz logic requires that, if a request, R, sits in the pool for T seconds without been matched, I need to make a reply to R saying something like "I cannot find a match for you".
Ideally, I want to do this with timers, specifically, for each incoming request, if there is no match, put it to the pool as before, but also start a timer to tell the gen_server to remove it if time is up, of course, if it is matched later, the timer should be cancelled.
My concern is that, If there are lots unmatched requests in the pool, then there would be lots of running timers, will this (too many timers) becomes a problem?
There were done big improvements in timers implementation in R18.
Besides the API changes and time warp modes a lot of
scalability and performance improvements regarding time
management has been made internally in the runtime system.
Examples of such improvements are scheduler specific timer
wheels, scheduler specific BIF timer management, parallel
retrieval of monotonic time and system time on systems with
primitives that are not buggy.
scheduler specific timer wheels is exactly what is interesting in your scenario. I doubt you would come around better performant solution of your problem in Erlang or any other language/environment. So your solution should be OK when you are using R18 or newer.
The application used by a group of 100+ users was made with VB6 and RDO. A replacement is coming, but the old one is still maintained. Users moved to a different building across the street and problems began. My opinion regarding the problem has been bandwidth, but I've had to argue with others who say it's database. Users regularly experience network slowness using the application, but also workstation tasks in general. The application moves large audio files and indexes them on occasion as well as others. Occasionally the database becomes hung. We have many top end, robust SQL Servers, so it is not a server problem. What I figured out is, a transaction is begun on a connection, but fails to complete properly because of a communication error. Updates from other connections become blocked, they continue stacking up, and users are down half a day. What I've begun doing the moment I'm told of a problem, after verifying the database is hung, is set the database to single user then back to multiuser to clear connections. They must all restart their applications. Today I found out there is a bandwidth limit at their new location which they regularly max out. I think in the old location there was a big pipe servicing many people, but now they are on a small pipe servicing a small number of people, which is also less tolerant of momentary high bandwidth demands.
What I want to know is exactly what happens to packets, both coming and going, when a bandwidth limit is reached. Also I want to know what happens in SQL Server communication. Do some packets get dropped? Do they start arriving more out of sequence? Do timing problems occur?
I plan to start controlling such things as file moves through the application. But I also want to know what configurations are usually present on network nodes regarding transient high demand.
This is a very broad question. Networking is very key (especially in Availability Groups or any sort of mirroring set up) to good performance. When transactions complete on the SQL server, they are then placed in the output buffer. The app then needs to 'pick up' that data, clear it's output buffer and continue on. I think (without knowing your configuration) that your apps aren't able to complete the round trip because the network pipe is inundated with requests, so the apps can't get what they need to successfully finish and close out. This causes havoc as the network can't keep up with what the apps and SQL server are trying to do. Then you have a 200 car pileup on a 1 lane highway.
Hindsight being what it is, there should have been extensive testing on the network capacity before everyone moved across the street. Clearly, that didn't happen so you are kind of left to do what you can with what you have. If the company can't get a stable networking connection, the situation may be out of your control. If you're the DBA, I highly recommend you speak to your higher ups and explain to them the consequences of the reduced network capacity. Often times, showing the consequences of inaction can lead to action.
Out of curiosity, is there any way you can analyze what waits are happening when the pileup happens? I'm thinking it will be something along the lines of ASYNC_NETWORK_IO which is usually indicative that SQL is waiting on the app to come back and pick up it's data.
I'm currently in the process of rewriting my java code to run it on Google App Engine. Since I cannot use Timer for programming timeouts (no thread creation allowed), I need to rely on system clock to mark the time of the timeout start so that I could compare it later in order to find out if the timeout has occurred.
Now, several people (even on Google payroll) have advised developers not to rely on system time due to the distributed nature of Google app servers and not being able to keep their clocks in sync. Some say the deviance of system clocks can be up to 10s or even more.
1s deviance would be very good for my app, 2 seconds can be tolerable, anything higher than that would cause a lot of grief for me and my app users, but 10 second difference would turn my app effectively unusable.
I don't know if anything has changed for the better since then (I hope yes), but if not, then what are my options other than shooting up a new separate request so that its handler would sleep the duration of the timeout (which cannot exceed 30 seconds due to request timeout limitation) in order to keep the timeout duration consistent.
Thanks!
More Specifically:
I'm trying to develop a poker game server, but for those who are not familiar how online poker works: I have a set of players attached to 1 game instance. Evey player has a certain amount of time to act before the timeout will occur so the next player can act. There is a countdown on each actor and every client has to see it. Only one player can act at a time. The timeout durations I need are 10s and 20s for now.
You should never be making your request handlers sleep or wait. App Engine will only automatically scale your app if request handlers complete in an average of 1000ms or less; deliberately waiting will ruin that. There's invariably a better option than sleeping/waiting - let us know what you're doing, and perhaps we can suggest one.
I am currently working on an n-tier system and battling some database performance issues.
One area we have been investigating is the latency between the database server and the application server. In our test environment the
average ping times between the two boxes is in the region of 0.2ms however on the clients site its more in the region of 8.2 ms. Is that
somthing we should be worried about?
For your average system what do you guys consider a resonable latency and how would you go about testing/measuring the latency?
Yes, network latency (measured by ping) can make a huge difference.
If your database response is .001ms then you will see a huge impact from going from a 0.2ms to 8ms ping. I've heard that database protocols are chatty, which if true means that they would be affected more by slow network latency versus HTTP.
And more than likely, if you are running 1 query, then adding 8ms to get the reply from the db is not going to matter. But if you are doing 10,000 queries which happens generally with bad code or non-optimized use of an ORM, then you will have wait an extra 80seconds for an 8ms ping, where for a 0.2ms ping, you would only wait 4 seconds.
As a matter of policy for myself, I never let client applications contact the database directly. I require that client applications always go through an application server (e.g. a REST web service). That way, if I accidentally have an "1+N" ORM issue, then it is not nearly as impactful. I would still try to fix the underlying problem...
In short : no !
What you should monitor is the global performance of your queries (ie transport to the DB + execution + transport back to your server)
What you could do is use a performance counter to monitor the time your queries usually take to execute.
You'll probably see your results are over the millisecond area.
There's no such thing as "Reasonable latency". You should rather consider the "Reasonable latency for your project", which would vary a lot depending on what you're working on.
People don't have the same expectation for a real-time trading platform and for a read only amateur website.
On a linux based server you can test the effect of latency yourself by using the tc command.
For example this command will add 10ms delay to all packets going via eth0
tc qdisc add dev eth0 root netem delay 10ms
use this command to remove the delay
tc qdisc del dev eth0 root
More details available here:
http://devresources.linux-foundation.org/shemminger/netem/example.html
All applications will differ, but I have definitely seen situations where 10ms latency has had a significant impact on the performance of the system.
One of the head honchos at answers.com said according to their studies, 400 ms wait time for a web page load is about the time when they first start getting people canceling the page load and going elsewhere. My advice is to look at the whole process, from original client request to fulfillment and if you're doing well there, there's no need to optimize further. 8.2 ms vs 0.2 ms is exponentially larger in a mathematical sense, but from a human sense, no one can really perceive an 8.0 ms difference. It's why they have photo finishes in races ;)