Get Complete Project Material File(s) Now! »
Mobile Data Traffic Prediction
To what degree is the Internet traffic predictable? It is a question that has led to a number of attractive issues and has been continuously investigated since the invention of the Internet [50]. In this section, we review the state-of-the-art on the prediction of mobile data traffic. Our discussion is organized from two perspectives:
Aggregated mobile data traffic. In this perspective, we consider the mobile data traffic from the viewpoint of a mobile network operator. Such data traffic is aggregated over many mobile devices within the same cell, the same close geographical area, or the same service/application.
Individual mobile data traffic. Here we discuss in an individual viewpoint, i.e., the mobile data traffic that is generated by a single mobile device.
For each perspective, we briefly introduce the data traffic characterization and particularly present the practical prediction techniques. It is worth noting that, in this section, we focus on the studies on the Internet traffic and exclude those on other traffic (e.g., voice calls).
Literature on Aggregated Mobile Data Traffic
The investigation on aggregated mobile data traffic is mainly driven by the analyses on world-wide large-scale operator-collected datasets. For instance, such datasets that have nationwide populations are deeply mined in the relevant studies by Paul et al. [27] (USA), Hoteit et al. [51] (France), and Xu et al. [37, 38] (China).
Characterization
There are two major aspects with respect to the characterization, i.e., temporal dynamics and spatiotemporal correlation.
There is a general agreement on the regularity of the temporal variation of aggregated mobile data traffic [23]. Almost at the same time, Paul et al. [27] and Shafiq et al. [33] separately investigate the temporal evolution of aggregated mobile data traffic of cell towers and popular applications. They both find that such traffic follows a daily repetitive pattern over weekdays: in general, the traffic has low demand during nighttime and high demand during daytime. The same repetitive pattern is also observed by Xu et al. [37, 38]. Is is also remarked in [27, 33, 52] that the traffic over weekdays and weekends have different repetitive patterns and demands; a larger data traffic demand exists on weekdays than weekends. An interesting fact is that the temporal variations observed by Paul et al. [27] and Shafiq et al. [33] have different peak hours, which is also observed from other network traffic [23]. For this, a possible explanation is that such temporal variation under a higher temporal resolution partially depends on the area of study.
The spatiotemporal correlation exists among the data traffic generated by cell towers over many users in the same area. In the pure spatial perspective, the distribution of the data traffic is spatially heterogeneous: it varies over different regions as revealed by Paul et al. [27] and Xu et al. [37, 38]. Further, the latter authors find that the cell towers have similar data traffic profiles regarding their regions (i.e., resident, transport, office, and entertainment) and such profiles of adjacent cell towers are correlated. In the spatiotemporal perspective, the two papers show that the spatial heterogeneity above also varies over time: pick hours depend on regions. The former authors leverage a quantitive measure (i.e., the Moran’s I statistic) to evaluate spatiotemporal diversity of the data traffic. They find that in general, the imminent loads of adjacent cell towers are more correlated when these loads are high, but the correlation is relatively weak and almost disappears around midnights. Recently, [53] further investigates the spatiotemporal correlation and propose an approach to infer the hidden spatial and temporal structures of aggregated mobile data traffic.
Also, several studies reveal the spatial heterogeneity aggregated over applications. The earlier work by Trestian et al. [40] already shows that the Internet traffic over services and applications is consumed differently at home and work locations. Hoteit et al. [51] find that the data traffic loads of cell towers have different inner diversities among TCP- and UDP-based services. Later, the extended analysis by Shafiq et al. [39] finds that the data traffic aggregated by popular applications is strongly heterogenous over regions. This provides the capability of categorizing cell towers into four classes (web browsing, email, audio, and mixed traffic) with respect to the major applications in their data traffic loads.
Prediction
Some efforts have been put on the prediction of aggregated mobile data traffic. They aim at converting the observed dynamics and correlations above to practical prediction techniques. In the following, we review the proposed prediction techniques according to the level of the aggregation. Cell-level data traffic. There is a common observation on the fact that the data traffic of cell towers has a high degree of both theoretical and practical predictabilities. Regarding the theoretical viewpoint, Zhang et al. [31, 32] investigate the limits of the theoretical predictability by observing the traffic of 7; 000 cell towers in China. They find that under the temporal resolution of 30 minutes, aggregated traffic (voice, text, and data) can be well predicted from the historical demand of the preceding 15 hours; the theoretical predictability of the data traffic is lower than that of the data flow of voice calls or text messages. They also find that the knowledge of the traffic demands of adjacent cells towers can enhance the theoretical predictability, but in a less degree on the data traffic than the others, which supports the quantitative evaluation on the spatiotemporal correlation by Paul et al. [27]. Their results ensure the capability of time series prediction techniques on the prediction of such traffic.
Regarding the practical prediction techniques, Xu et al. [37, 38] show that the cell-level data traffic is predictable via a linear combination of four primary components corresponding to human activities. Zang et al. [54] propose a mixed machine learning approach composed of K-means clustering, Elman Neural Network, and wavelet decom-position. An alternative prediction approach is proposed by Yi et al. [55]; it builds a complex network among cell towers, measures the traffic on the very important ones, and predicts the others’ traffic using Support Vector Regression – another machine learn-ing method. It can recover the whole picture of the traffic demand from only 8% of the total cell towers. In the opposite viewpoint, Nika et al. [36] perform an empirical study on data hotspots using a large-scale operator-collected dataset of 5; 327 cell towers, and show the availability of standard machine learning methods on the prediction of future hotspots (cells towers) of the traffic demand from the past history.
Application-level data traffic. The early paper by Keralapura et al.[56] proposes a technique to cluster users and their browsing profiles. The authors find that user behav-ior in terms of Internet surfing can be captured using a small number of clusters. Such heterogeneity of aggregated mobile data traffic is also investigated by Ying et al. [57]. Later, Shafiq et al. [33] uses a Zipf-like model to capture the distribution of application-level mobile data traffic and finds that the regularity makes the temporal variation of the traffic highly predictable from the history of the past demand using a simple Markovian method. Recently, Zhang et al. [58] design a mixed application-level traffic prediction framework that leverages the -stable modeled property and dictionary learning to sep-arately deal with the temporal variation and the spatial sparsity of the traffic. Marquez et al. [59] extend the analysis in [33] and reveal a strong heterogeneity in difference mobile service demands using correlation and clustering. They show that the temporal usage patterns are quite different from service to service. Besides, several works focus on the traffic generated by special services, such as chatting (e.g., WhatsApp [60] and WeChat [61]), video streaming [62], and mobile cloud [63].
In summary, the proposed techniques extend the technical bound on the prediction of mobile data traffic: they not only leverage the legacy tools that used for analyzing wired network traffic (e.g., the entropy, Markov property, -stable modeled property) to capture the temporal variation but also import several state-of-the-art machine learning tools to utilize the spatiotemporal correlation.
Literature on Individual Mobile Data Traffic
A relatively small body of literature is on the investigation of individual mobile data traffic, which is also driven by data mining. Differently, the relevant studies utilize both large-scale operator-collected datasets, e.g., by Paul et al. [27] and Oliveira et al. [34, 52], and small-scale mobile crowdsensing datasets, e.g., by Jo et al. [35].
Characterization
The characterization from the individual viewpoint is performed by Paul et al. [27], Jo et al. [35], Li et al. [42], Oliveira et al. [34, 52], among others.
There is a general agreement on the heterogeneity of the data traffic, with respect to the user population and the time. It is shared by Paul et al. [27] and Oliveira et al. [34, 52]. They show that most of the total data traffic is generated from a small group of « heavy » users.
Regarding the temporal variation, both the authors above find that, in general, each user is highly active only in a few hours per day, and similarly, the temporal variation is different on weekdays and weekends, as in aggregate mobile data traffic. The latter authors [34, 52] find that individual mobile data traffic also follows daily repetitive patterns and the users also have peak and non-peak hours in terms of data traffic. In particular, they find that the variation of different hours within the same day is stronger than that of the same hours overs different days.
As to the spatiotemporal correlation, Paul et al. [27] point out that a user is usually active at only a few of his common locations. Jo et al. [35] mine a small dataset of locations and services of 124 users over 16 months and they identify the spatiotemporal correlations of service usage patterns.
Other dynamics with respect to social features are also revealed. For instance, Oliveira et al. [34, 52] find that the distribution of individual mobile data traffic is slightly heterogeneous over the age and gender; Li et al. [42] focus on the major smartphone operating systems and discuss the traffic dynamics and major application in each system.
Prediction
Yet, fewer studies have addressed the prediction of individual mobile data traffic. Regarding the bandwidth of mobile devices around a cell tower, a theoretical analysis is performed by Bui et al. [43, 44]. Based on a theoretical LTE radio model, the authors propose a model to predict the bandwidth of mobile devices over a wide range of time scales [43]. Their model considers both the user location and the statistic of bandwidth availability. They also design a refined model aiming at the prediction of short-term bandwidth using Gaussian Random Walks [44]. Regarding the latency of each data session, Zhao et al. [11, 45] address the static and dynamic latency estimation problems and propose a distance-feature decomposition algorithm based on the Matrix Factorization technique to predict the latency.
In summary, the current literature has already shown the temporal dynamics and spatio-temporal correlations of both aggregated and individual mobile data traffic, while only a few practical prediction techniques are proposed aiming at the latter’s prediction. Still, a large amount of the investigation on individual mobile traffic is needed. For instance, to perform data-driven prediction analysis on the theoretical and practical predictabilities of individual mobile data traffic.
Taxonomy of Prediction Techniques
In this thesis, we study the prediction of individual mobile data traffic by mining individual time series of locations and data sessions. For that, here we review the state-of-the-art techniques for the time series prediction. Some of them have been already used in the prediction of the Internet traffic.
Theoretical Predictability Measurements
The theoretical predictability of a time series describes its inherent forecasting capacity and is independent of any actual prediction technique. It evaluates to what degree the time se-ries can be foreseen and bounds to the maximum forecasting performance. The theoretical predictability is correlated with the uncertainty that is usually measured by the entropy in information theory [64]. Feder et al. [65] build the quantitive correlation between the pre-dictability and entropy, which is later applied on the study of an actual behavior (i.e., human mobility) by Song et al. [17]. They are then followed by similar predictability studies on human mobility [66, 67], vehicular traffic [68], radio spectrum state [69], and aggregated mobile data traffic [31]. Nevertheless, the theoretical predictability of per-user mobile data traffic remains untouched.
Analyzing the theoretical predictability helps the design and implementation of practical prediction techniques e.g., human mobility prediction [66]. In this thesis, we study the theoreti-cal predictability of mobile data traffic on the foundation of Feder et al. [65]. Yet, the practical predictability of time series still depends on the prediction techniques, e.g., probabilistic or regression models. We summarize these practical techniques in the following.
ARIMA Prediction Models
This class contains the typical ARIMA, AR, MA, and ARMA models. They are designed for the prediction of time series of continuous scalar values and leverage moving average and autoregressive filtering. They often appear in the time series analyses of statistics and economic studies [70]. The application of the models assumes that the target time series or its difference is stationary [70]. Besides the typical models, the others in this class, such as MEAN, LAST, BM, FARIMA, and GARCH, are used to predict network resources in wired networks [71, 72, 73, 74] with smoothing solutions (e.g., wavelet approximation) [43]. Nevertheless, the designs of the typical models cannot combine discrete values (e.g., visited cell tower locations) in their prediction, and none of these works has considered the spatial dynamics of their targets.
Markovian Prediction Methods
The methods in this class, i.e., MC, PPM, SPM, and ALZ, leverage the Markov property. They are mainly designed for the prediction of time series of discrete observations. Their application assumes that the target time series has the Markov property, i.e., its current value is always de-termined by a limited number of its previous values. In this class, a prediction method predicts the current value Xt of a time series by building a probabilistic model from its full history and solving the following maximization problem: x^t = arg max x P (Xt = xjxt 1; ; xt k) where xt 1; ; xt k are the newest k previous values observed in the time series.
MC (Markov Chain) This is almost the simplest Markovian method [75]. A k-th order Markov chain, represented as MC(k), makes a prediction of the state Xt solely based on the fixed previous k states. It builds a transition matrix consisting of the probabilities of transitions from the past k states to the current one. There are several common practices to compute the probabilities, such as MLE (maximum likelihood estimation) [76] and MCMC (Markov Chain Monte Charlo) [77]. However, it needs a large number of samples to compute probabilities, which grows quickly with respect to the order k and the alphabet of discrete values.
PPM (Prediction by Partial Matching) This is an improved method of the Markov chain, used massively in the lossless text compression [78]. A k-th order PPM model, repre-sented as PPM(k), is a combination of MC(m); 8m k. It computes the so-called escape probabilities of the state Xt as the weighted sums of the probabilities of all the Markov models in the combination. We use the implementation of Moffat et al. [78, Method C] in this thesis.
SPM (Sampled Pattern Matching) This is another improved method of the Markov chain designed by Jacquet et al. [79]; for predicting the current state Xt, it considers much larger immediately preceding states than the MC or PPM does. In a SPM predictor, instead of using a fixed order k, the considered length of immediately preceding context is determined as a fixed fraction (represented as the parameter ) of the longest context which has previously appeared. An SPM model with the parameter is represented as SPM( ).
ALZ (Active LeZi) This is an improved online prediction algorithm based on the classical LZ78 data compression scheme proposed by Gopalratnam et al. [80]. It also employs the power of the Markov property and is able to incrementally learn the sequence and to deliver real time predictions. The ALZ algorithm also makes a prediction of the state Xt in the time series given the preceding context, while a variable window of immediately preceding symbols is maintained, of which the length is the longest phrase previously observed in a classical LZ78 parsing. With this window, the algorithm can compute statistics on all possible preceding contexts. For the pseudo code of this algorithm, we refer the reader to [80, Figure 3].
Supervised Machine Learning Methods
This class contains a group of state-of-the-art techniques that are categorized to the field of supervised machine learning in practice [81]. These techniques can solve problems in the shape of y = f(x) where x and y are the input and output vectors. Each of them builds a model (which is composed of kernel functions, decision/regression tress, or neuron networks) upon a training set that consists of known instances of x and its corresponding y as the output classes (in a classification problem) or values (a regression problem). Then, the trained model can predict y from x in a new instance. They are capable of forecasting time series of continuous and discrete values.
SVM (Support Vector Machine) This is actually a classification algorithm [75]. It assigns a new input vector x to one of the already known output classes y, just as other classification algorithms do. SVM is adaptive to both linear and non-linear discriminant functions during the training process. It applies a kernel function to map the training data to a high-dimensional feature space, which ensures the ability to produce a more accurate classification. Further, SVM is insensitive to the number of input features.
GBRT (Gradient Boosted Regression Trees) This algorithm is on the basis of the en-semble learning technique [82]. It build a prediction model as an ensemble of weak prediction models, i.e., regression trees, in a state-wise fashion and allows optimization of an arbitrary differentiable loss function. The GBRT algorithm has advantages in terms of flexibility for heterogeneous features, good predictive power, and training speed, and has well-tested imple-mentations.
MLP (Multilayer Perceptron) This algorithm is a typical supervised learning algorithm using artificial neural networks. It is designed for both regression and classification problems. A MLP network is a feedforward artificial neural network that is fully connected. The MLP algorithm accepts different activation functions, layers and neurons [83].
To the best of our knowledge, there is still no study about these prediction techniques’ per-formance on per-user mobile data traffic prediction. We are not able to judge which is superior so far, and thus, will perform a comparative study in this thesis. First, ARIMA models are not considered because they do not support discrete values, which is necessary for our analy-sis. Second, all Markovian methods will be utilized, since they show impressive performance on similar time series prediction scenarios, such as human mobility prediction [76] and aggre-gated mobile data traffic prediction [33]. Besides, supervised machine learning techniques are shown to be effective in multiple application scenarios [75, 83], including aggregated Internet traffic [84]. We will also consider them in our analysis.
Operator-collected Mobility Data Utilization
A large amount of operator-collected datasets has been utilized in the literature [23], due to the spreading of mobile devices and the enlarging openness of mobile network operators. In the wide scope of research disciplines driven by mining the operator-collected datasets, the analytic data investigation of human mobility is one of the most significant topics [23]. In this context, the mobile network works like a gigantic and multi-modal sensing platform that produces time-stamped human footprints. In this section, we introduce the background on the utilization of the operator-collected datasets. Our focus is on the state-of-the-art means of processing human footprints with the limited spatiotemporal granularity [23] in order to provide an introductory guide.
The Events that Produce Human Footprints
Human footprints in most of the operator-collected datasets come from CDR – once referred to Call Detail Records and later to Charging Data Records in the 3GPP specification [85, TS 32.298]. The collection of CDR is triggered by the so-called charging events [85, TS 32.250/32.251]. In the following, we list the types of charging events that lead to human footprints.
Voice call. This is the most common trigger to the collection of operator-collected datasets in the literature [23]. Each voice call CDR marks (i) when the call originates and terminates and (ii) only the cell tower where the call originates. Significant effort has been put on the characterization of voice calls [46, 86, 87, 88, 89, 90, 91]; it is well known that the temporal heterogeneity and sparsity of voice calls limit the quality of mobility information and cause a heavy data loss of locations.
Text message. This is also a common trigger and produces text message CDR that are similar to voice call CDR. Yet, each text message triggers two CDR separately, which correspond to the origination and delivery of the message and mark the cell towers from/to which the message is created/delivered. Usually, text message CDR appear together with voice call CDR in the operator-collected datasets [23]. According to the characterization [46], text messages also only provide limited mobility information due to their temporal heterogeneity and sparsity.
Internet visit. This event triggers multiple CDR types such as S-CDR, SGW-CDR, and PGW-CDR [85, TS 32.251]. Only the S-CDR has a mandatory parameter describing the location of devices; it points to the cell tower where a mobile device originated a session for Internet visits. This event appears as a new source of human footprints in recent years, and is used in several studies [27, 40, 46, 92, 93, 94, 95]. Internet visits provide far more human footprints than voice calls or text messages; they still suffer the temporal heterogeneity but are far less sparse than the other two events [52]. Therefore, a operator-collected dataset of Internet visits usually appears as the ground-truth data as in [46, 96].
Mobility update. Every time a mobile device enters a roaming process or has an inter-system handover, a CDR is triggered and contains the identifier of the new cell tower. Several operator-collected datasets contain human footprints collected by this events along with Internet visits, including [96, 97, 98].
Location request. The 3GPP specification [85, TS 23.271] describes the capability for both the network and mobile devices to initiate location requests. Once a location request is created, the network use its built-in LCS (location services) method [99] to measure the position of the target mobile device, and two CDR are recorded: one for the origination of the location request and the other for the termination [85, TS 32.251]. However, to the best of our knowledge, none of the datasets collects the location request CDR.
Others. Events such as power on/off and network switch (LTE to 2G/3G) may also trigger the generation of CDR of their own types and help the collection of human footprints, while they rarely appear in the literature: we only find one dataset [96] uses these events together with Internet visits.
Consequently, voice calls and text messages are still the major contributors to human foot-prints in the literature heretofore [23], while the mobility information among them has limited temporal granularity due to their high temporal heterogeneity and sparsity. Nevertheless, due to the fact that there are more geo-referenced CDR types while having more sensitive user behavior data, we may have mobile network operators with better openness and operator-collected datasets with a higher temporal resolution in future.
Extracting Geographical Coordinates from CDR
The aforementioned charging events help the collection of human footprints in the literature. Particularly, in the CDR of these charging events, there are specific parameters which provide those human footprints, named and defined in the 3GPP specification [85, TS 29.002]. With respect to the provided information, we categorize these parameters into three classes, i.e., who (Subscriber Equipment Number, Served IMEI, or Served IMSI), when (Event time stamps), and where (Cell Identifier or Location Estimate) [85, TS 29.002].
The where parameters need some explanation. The cell identifier parameter is the digit identifier of the serving sector, antenna, or cell tower that a mobile device communicates with.
It is included as a mandatory parameter in the CDR of all the charging events mentioned above, while it is not a geographical coordinate. The location estimate parameter represents the geographical coordinate of a mobile device, but is only included in the CDR of a location request which is not a common trigger of the collection of human footprints. Therefore, in most of the operator-collected datasets, the dataset users or the data providers have to convert those cell identifiers to actual geographical coordinates. In the following, we summarize the major means of the conversion.
Table of contents :
1 Introduction
1.1 Predicting Per-user Mobile Data Traffic
1.2 Utilizing Operator-collected Datasets
1.3 Contributions and Thesis Outline
2 Background
2.1 Mobile Data Traffic Prediction
2.2 Operator-collected Mobility Data Utilization
2.3 Summary
3 Datasets: Characteristics and Challenges
3.1 Human Behavior Collection
3.2 Operator-collected Large-scale Datasets
3.3 Application-based Mobility Datasets
3.4 Challenge of Completeness
3.5 Challenge of Mobility Measurement
3.6 Challenge of Data Processing
3.7 Summary
4 CDR-based Trajectory Completion
4.1 Terminology
4.2 Completing Instant CDR-based Trajectories
4.3 Completing Slotted CDR-based Trajectories
4.4 Summary
5 Per-User Mobile Data Traffic Prediction
5.1 Terminology and Definitions
5.2 Characterizing Individual Mobile Data Traffic
5.3 Constructing Per-user Spatiotemporal Behavioral Data
5.4 Investigation through Temporal Dynamics
5.5 Investigation through Spatiotemporal Dynamics
5.6 Additional Investigation of Human Mobility
5.7 Summary
6 Conclusion and Outlook
6.1 Summary of the Thesis
6.2 Limitation and Outlook
6.3 Concluding Remarks
Bibliography