Predicting Per-user Mobile Data Traffic – Project topics materials

Get Complete Project Material File(s) Now! »

Mobile Data Traﬃc Prediction

To what degree is the Internet traﬃc predictable? It is a question that has led to a number of attractive issues and has been continuously investigated since the invention of the Internet [50]. In this section, we review the state-of-the-art on the prediction of mobile data traﬃc. Our discussion is organized from two perspectives:
Aggregated mobile data traﬃc. In this perspective, we consider the mobile data traﬃc from the viewpoint of a mobile network operator. Such data traﬃc is aggregated over many mobile devices within the same cell, the same close geographical area, or the same service/application.
Individual mobile data traﬃc. Here we discuss in an individual viewpoint, i.e., the mobile data traﬃc that is generated by a single mobile device.
For each perspective, we briefly introduce the data traﬃc characterization and particularly present the practical prediction techniques. It is worth noting that, in this section, we focus on the studies on the Internet traﬃc and exclude those on other traﬃc (e.g., voice calls).

Literature on Aggregated Mobile Data Traﬃc

The investigation on aggregated mobile data traﬃc is mainly driven by the analyses on world-wide large-scale operator-collected datasets. For instance, such datasets that have nationwide populations are deeply mined in the relevant studies by Paul et al. [27] (USA), Hoteit et al. [51] (France), and Xu et al. [37, 38] (China).

Characterization

There are two major aspects with respect to the characterization, i.e., temporal dynamics and spatiotemporal correlation.
There is a general agreement on the regularity of the temporal variation of aggregated mobile data traﬃc [23]. Almost at the same time, Paul et al. [27] and Shafiq et al. [33] separately investigate the temporal evolution of aggregated mobile data traﬃc of cell towers and popular applications. They both find that such traﬃc follows a daily repetitive pattern over weekdays: in general, the traﬃc has low demand during nighttime and high demand during daytime. The same repetitive pattern is also observed by Xu et al. [37, 38]. Is is also remarked in [27, 33, 52] that the traﬃc over weekdays and weekends have diﬀerent repetitive patterns and demands; a larger data traﬃc demand exists on weekdays than weekends. An interesting fact is that the temporal variations observed by Paul et al. [27] and Shafiq et al. [33] have diﬀerent peak hours, which is also observed from other network traﬃc [23]. For this, a possible explanation is that such temporal variation under a higher temporal resolution partially depends on the area of study.
The spatiotemporal correlation exists among the data traﬃc generated by cell towers over many users in the same area. In the pure spatial perspective, the distribution of the data traﬃc is spatially heterogeneous: it varies over diﬀerent regions as revealed by Paul et al. [27] and Xu et al. [37, 38]. Further, the latter authors find that the cell towers have similar data traﬃc profiles regarding their regions (i.e., resident, transport, oﬃce, and entertainment) and such profiles of adjacent cell towers are correlated. In the spatiotemporal perspective, the two papers show that the spatial heterogeneity above also varies over time: pick hours depend on regions. The former authors leverage a quantitive measure (i.e., the Moran’s I statistic) to evaluate spatiotemporal diversity of the data traﬃc. They find that in general, the imminent loads of adjacent cell towers are more correlated when these loads are high, but the correlation is relatively weak and almost disappears around midnights. Recently, [53] further investigates the spatiotemporal correlation and propose an approach to infer the hidden spatial and temporal structures of aggregated mobile data traﬃc.
Also, several studies reveal the spatial heterogeneity aggregated over applications. The earlier work by Trestian et al. [40] already shows that the Internet traﬃc over services and applications is consumed diﬀerently at home and work locations. Hoteit et al. [51] find that the data traﬃc loads of cell towers have diﬀerent inner diversities among TCP- and UDP-based services. Later, the extended analysis by Shafiq et al. [39] finds that the data traﬃc aggregated by popular applications is strongly heterogenous over regions. This provides the capability of categorizing cell towers into four classes (web browsing, email, audio, and mixed traﬃc) with respect to the major applications in their data traﬃc loads.

Prediction

Some eﬀorts have been put on the prediction of aggregated mobile data traﬃc. They aim at converting the observed dynamics and correlations above to practical prediction techniques. In the following, we review the proposed prediction techniques according to the level of the aggregation. Cell-level data traﬃc. There is a common observation on the fact that the data traﬃc of cell towers has a high degree of both theoretical and practical predictabilities. Regarding the theoretical viewpoint, Zhang et al. [31, 32] investigate the limits of the theoretical predictability by observing the traﬃc of 7; 000 cell towers in China. They find that under the temporal resolution of 30 minutes, aggregated traﬃc (voice, text, and data) can be well predicted from the historical demand of the preceding 15 hours; the theoretical predictability of the data traﬃc is lower than that of the data flow of voice calls or text messages. They also find that the knowledge of the traﬃc demands of adjacent cells towers can enhance the theoretical predictability, but in a less degree on the data traﬃc than the others, which supports the quantitative evaluation on the spatiotemporal correlation by Paul et al. [27]. Their results ensure the capability of time series prediction techniques on the prediction of such traﬃc.
Regarding the practical prediction techniques, Xu et al. [37, 38] show that the cell-level data traﬃc is predictable via a linear combination of four primary components corresponding to human activities. Zang et al. [54] propose a mixed machine learning approach composed of K-means clustering, Elman Neural Network, and wavelet decom-position. An alternative prediction approach is proposed by Yi et al. [55]; it builds a complex network among cell towers, measures the traﬃc on the very important ones, and predicts the others’ traﬃc using Support Vector Regression – another machine learn-ing method. It can recover the whole picture of the traﬃc demand from only 8% of the total cell towers. In the opposite viewpoint, Nika et al. [36] perform an empirical study on data hotspots using a large-scale operator-collected dataset of 5; 327 cell towers, and show the availability of standard machine learning methods on the prediction of future hotspots (cells towers) of the traﬃc demand from the past history.
Application-level data traﬃc. The early paper by Keralapura et al.[56] proposes a technique to cluster users and their browsing profiles. The authors find that user behav-ior in terms of Internet surfing can be captured using a small number of clusters. Such heterogeneity of aggregated mobile data traﬃc is also investigated by Ying et al. [57]. Later, Shafiq et al. [33] uses a Zipf-like model to capture the distribution of application-level mobile data traﬃc and finds that the regularity makes the temporal variation of the traﬃc highly predictable from the history of the past demand using a simple Markovian method. Recently, Zhang et al. [58] design a mixed application-level traﬃc prediction framework that leverages the -stable modeled property and dictionary learning to sep-arately deal with the temporal variation and the spatial sparsity of the traﬃc. Marquez et al. [59] extend the analysis in [33] and reveal a strong heterogeneity in diﬀerence mobile service demands using correlation and clustering. They show that the temporal usage patterns are quite diﬀerent from service to service. Besides, several works focus on the traﬃc generated by special services, such as chatting (e.g., WhatsApp [60] and WeChat [61]), video streaming [62], and mobile cloud [63].
In summary, the proposed techniques extend the technical bound on the prediction of mobile data traﬃc: they not only leverage the legacy tools that used for analyzing wired network traﬃc (e.g., the entropy, Markov property, -stable modeled property) to capture the temporal variation but also import several state-of-the-art machine learning tools to utilize the spatiotemporal correlation.

Literature on Individual Mobile Data Traﬃc

A relatively small body of literature is on the investigation of individual mobile data traﬃc, which is also driven by data mining. Diﬀerently, the relevant studies utilize both large-scale operator-collected datasets, e.g., by Paul et al. [27] and Oliveira et al. [34, 52], and small-scale mobile crowdsensing datasets, e.g., by Jo et al. [35].

Characterization

The characterization from the individual viewpoint is performed by Paul et al. [27], Jo et al. [35], Li et al. [42], Oliveira et al. [34, 52], among others.
There is a general agreement on the heterogeneity of the data traﬃc, with respect to the user population and the time. It is shared by Paul et al. [27] and Oliveira et al. [34, 52]. They show that most of the total data traﬃc is generated from a small group of « heavy » users.
Regarding the temporal variation, both the authors above find that, in general, each user is highly active only in a few hours per day, and similarly, the temporal variation is diﬀerent on weekdays and weekends, as in aggregate mobile data traﬃc. The latter authors [34, 52] find that individual mobile data traﬃc also follows daily repetitive patterns and the users also have peak and non-peak hours in terms of data traﬃc. In particular, they find that the variation of diﬀerent hours within the same day is stronger than that of the same hours overs diﬀerent days.
As to the spatiotemporal correlation, Paul et al. [27] point out that a user is usually active at only a few of his common locations. Jo et al. [35] mine a small dataset of locations and services of 124 users over 16 months and they identify the spatiotemporal correlations of service usage patterns.
Other dynamics with respect to social features are also revealed. For instance, Oliveira et al. [34, 52] find that the distribution of individual mobile data traﬃc is slightly heterogeneous over the age and gender; Li et al. [42] focus on the major smartphone operating systems and discuss the traﬃc dynamics and major application in each system.

Prediction

Yet, fewer studies have addressed the prediction of individual mobile data traﬃc. Regarding the bandwidth of mobile devices around a cell tower, a theoretical analysis is performed by Bui et al. [43, 44]. Based on a theoretical LTE radio model, the authors propose a model to predict the bandwidth of mobile devices over a wide range of time scales [43]. Their model considers both the user location and the statistic of bandwidth availability. They also design a refined model aiming at the prediction of short-term bandwidth using Gaussian Random Walks [44]. Regarding the latency of each data session, Zhao et al. [11, 45] address the static and dynamic latency estimation problems and propose a distance-feature decomposition algorithm based on the Matrix Factorization technique to predict the latency.
In summary, the current literature has already shown the temporal dynamics and spatio-temporal correlations of both aggregated and individual mobile data traﬃc, while only a few practical prediction techniques are proposed aiming at the latter’s prediction. Still, a large amount of the investigation on individual mobile traﬃc is needed. For instance, to perform data-driven prediction analysis on the theoretical and practical predictabilities of individual mobile data traﬃc.

Taxonomy of Prediction Techniques

In this thesis, we study the prediction of individual mobile data traﬃc by mining individual time series of locations and data sessions. For that, here we review the state-of-the-art techniques for the time series prediction. Some of them have been already used in the prediction of the Internet traﬃc.

Theoretical Predictability Measurements

The theoretical predictability of a time series describes its inherent forecasting capacity and is independent of any actual prediction technique. It evaluates to what degree the time se-ries can be foreseen and bounds to the maximum forecasting performance. The theoretical predictability is correlated with the uncertainty that is usually measured by the entropy in information theory [64]. Feder et al. [65] build the quantitive correlation between the pre-dictability and entropy, which is later applied on the study of an actual behavior (i.e., human mobility) by Song et al. [17]. They are then followed by similar predictability studies on human mobility [66, 67], vehicular traﬃc [68], radio spectrum state [69], and aggregated mobile data traﬃc [31]. Nevertheless, the theoretical predictability of per-user mobile data traﬃc remains untouched.
Analyzing the theoretical predictability helps the design and implementation of practical prediction techniques e.g., human mobility prediction [66]. In this thesis, we study the theoreti-cal predictability of mobile data traﬃc on the foundation of Feder et al. [65]. Yet, the practical predictability of time series still depends on the prediction techniques, e.g., probabilistic or regression models. We summarize these practical techniques in the following.

READ 3D PRINTING OF MAGNETO-RESPONSIVE POLYMERS WITH RANDOMLY DISPERSED MAGNETIC NANOPARTICLES

ARIMA Prediction Models

This class contains the typical ARIMA, AR, MA, and ARMA models. They are designed for the prediction of time series of continuous scalar values and leverage moving average and autoregressive filtering. They often appear in the time series analyses of statistics and economic studies [70]. The application of the models assumes that the target time series or its diﬀerence is stationary [70]. Besides the typical models, the others in this class, such as MEAN, LAST, BM, FARIMA, and GARCH, are used to predict network resources in wired networks [71, 72, 73, 74] with smoothing solutions (e.g., wavelet approximation) [43]. Nevertheless, the designs of the typical models cannot combine discrete values (e.g., visited cell tower locations) in their prediction, and none of these works has considered the spatial dynamics of their targets.

Markovian Prediction Methods

The methods in this class, i.e., MC, PPM, SPM, and ALZ, leverage the Markov property. They are mainly designed for the prediction of time series of discrete observations. Their application assumes that the target time series has the Markov property, i.e., its current value is always de-termined by a limited number of its previous values. In this class, a prediction method predicts the current value Xt of a time series by building a probabilistic model from its full history and solving the following maximization problem: x^t = arg max x P (Xt = xjxt 1; ; xt k) where xt 1; ; xt k are the newest k previous values observed in the time series.
MC (Markov Chain) This is almost the simplest Markovian method [75]. A k-th order Markov chain, represented as MC(k), makes a prediction of the state Xt solely based on the fixed previous k states. It builds a transition matrix consisting of the probabilities of transitions from the past k states to the current one. There are several common practices to compute the probabilities, such as MLE (maximum likelihood estimation) [76] and MCMC (Markov Chain Monte Charlo) [77]. However, it needs a large number of samples to compute probabilities, which grows quickly with respect to the order k and the alphabet of discrete values.
PPM (Prediction by Partial Matching) This is an improved method of the Markov chain, used massively in the lossless text compression [78]. A k-th order PPM model, repre-sented as PPM(k), is a combination of MC(m); 8m k. It computes the so-called escape probabilities of the state Xt as the weighted sums of the probabilities of all the Markov models in the combination. We use the implementation of Moﬀat et al. [78, Method C] in this thesis.
SPM (Sampled Pattern Matching) This is another improved method of the Markov chain designed by Jacquet et al. [79]; for predicting the current state Xt, it considers much larger immediately preceding states than the MC or PPM does. In a SPM predictor, instead of using a fixed order k, the considered length of immediately preceding context is determined as a fixed fraction (represented as the parameter ) of the longest context which has previously appeared. An SPM model with the parameter is represented as SPM( ).
ALZ (Active LeZi) This is an improved online prediction algorithm based on the classical LZ78 data compression scheme proposed by Gopalratnam et al. [80]. It also employs the power of the Markov property and is able to incrementally learn the sequence and to deliver real time predictions. The ALZ algorithm also makes a prediction of the state Xt in the time series given the preceding context, while a variable window of immediately preceding symbols is maintained, of which the length is the longest phrase previously observed in a classical LZ78 parsing. With this window, the algorithm can compute statistics on all possible preceding contexts. For the pseudo code of this algorithm, we refer the reader to [80, Figure 3].

Supervised Machine Learning Methods

This class contains a group of state-of-the-art techniques that are categorized to the field of supervised machine learning in practice [81]. These techniques can solve problems in the shape of y = f(x) where x and y are the input and output vectors. Each of them builds a model (which is composed of kernel functions, decision/regression tress, or neuron networks) upon a training set that consists of known instances of x and its corresponding y as the output classes (in a classification problem) or values (a regression problem). Then, the trained model can predict y from x in a new instance. They are capable of forecasting time series of continuous and discrete values.
SVM (Support Vector Machine) This is actually a classification algorithm [75]. It assigns a new input vector x to one of the already known output classes y, just as other classification algorithms do. SVM is adaptive to both linear and non-linear discriminant functions during the training process. It applies a kernel function to map the training data to a high-dimensional feature space, which ensures the ability to produce a more accurate classification. Further, SVM is insensitive to the number of input features.
GBRT (Gradient Boosted Regression Trees) This algorithm is on the basis of the en-semble learning technique [82]. It build a prediction model as an ensemble of weak prediction models, i.e., regression trees, in a state-wise fashion and allows optimization of an arbitrary diﬀerentiable loss function. The GBRT algorithm has advantages in terms of flexibility for heterogeneous features, good predictive power, and training speed, and has well-tested imple-mentations.
MLP (Multilayer Perceptron) This algorithm is a typical supervised learning algorithm using artificial neural networks. It is designed for both regression and classification problems. A MLP network is a feedforward artificial neural network that is fully connected. The MLP algorithm accepts diﬀerent activation functions, layers and neurons [83].
To the best of our knowledge, there is still no study about these prediction techniques’ per-formance on per-user mobile data traﬃc prediction. We are not able to judge which is superior so far, and thus, will perform a comparative study in this thesis. First, ARIMA models are not considered because they do not support discrete values, which is necessary for our analy-sis. Second, all Markovian methods will be utilized, since they show impressive performance on similar time series prediction scenarios, such as human mobility prediction [76] and aggre-gated mobile data traﬃc prediction [33]. Besides, supervised machine learning techniques are shown to be eﬀective in multiple application scenarios [75, 83], including aggregated Internet traﬃc [84]. We will also consider them in our analysis.

Operator-collected Mobility Data Utilization

A large amount of operator-collected datasets has been utilized in the literature [23], due to the spreading of mobile devices and the enlarging openness of mobile network operators. In the wide scope of research disciplines driven by mining the operator-collected datasets, the analytic data investigation of human mobility is one of the most significant topics [23]. In this context, the mobile network works like a gigantic and multi-modal sensing platform that produces time-stamped human footprints. In this section, we introduce the background on the utilization of the operator-collected datasets. Our focus is on the state-of-the-art means of processing human footprints with the limited spatiotemporal granularity [23] in order to provide an introductory guide.

The Events that Produce Human Footprints

Human footprints in most of the operator-collected datasets come from CDR – once referred to Call Detail Records and later to Charging Data Records in the 3GPP specification [85, TS 32.298]. The collection of CDR is triggered by the so-called charging events [85, TS 32.250/32.251]. In the following, we list the types of charging events that lead to human footprints.
Voice call. This is the most common trigger to the collection of operator-collected datasets in the literature [23]. Each voice call CDR marks (i) when the call originates and terminates and (ii) only the cell tower where the call originates. Significant eﬀort has been put on the characterization of voice calls [46, 86, 87, 88, 89, 90, 91]; it is well known that the temporal heterogeneity and sparsity of voice calls limit the quality of mobility information and cause a heavy data loss of locations.
Text message. This is also a common trigger and produces text message CDR that are similar to voice call CDR. Yet, each text message triggers two CDR separately, which correspond to the origination and delivery of the message and mark the cell towers from/to which the message is created/delivered. Usually, text message CDR appear together with voice call CDR in the operator-collected datasets [23]. According to the characterization [46], text messages also only provide limited mobility information due to their temporal heterogeneity and sparsity.
Internet visit. This event triggers multiple CDR types such as S-CDR, SGW-CDR, and PGW-CDR [85, TS 32.251]. Only the S-CDR has a mandatory parameter describing the location of devices; it points to the cell tower where a mobile device originated a session for Internet visits. This event appears as a new source of human footprints in recent years, and is used in several studies [27, 40, 46, 92, 93, 94, 95]. Internet visits provide far more human footprints than voice calls or text messages; they still suﬀer the temporal heterogeneity but are far less sparse than the other two events [52]. Therefore, a operator-collected dataset of Internet visits usually appears as the ground-truth data as in [46, 96].
Mobility update. Every time a mobile device enters a roaming process or has an inter-system handover, a CDR is triggered and contains the identifier of the new cell tower. Several operator-collected datasets contain human footprints collected by this events along with Internet visits, including [96, 97, 98].
Location request. The 3GPP specification [85, TS 23.271] describes the capability for both the network and mobile devices to initiate location requests. Once a location request is created, the network use its built-in LCS (location services) method [99] to measure the position of the target mobile device, and two CDR are recorded: one for the origination of the location request and the other for the termination [85, TS 32.251]. However, to the best of our knowledge, none of the datasets collects the location request CDR.
Others. Events such as power on/oﬀ and network switch (LTE to 2G/3G) may also trigger the generation of CDR of their own types and help the collection of human footprints, while they rarely appear in the literature: we only find one dataset [96] uses these events together with Internet visits.
Consequently, voice calls and text messages are still the major contributors to human foot-prints in the literature heretofore [23], while the mobility information among them has limited temporal granularity due to their high temporal heterogeneity and sparsity. Nevertheless, due to the fact that there are more geo-referenced CDR types while having more sensitive user behavior data, we may have mobile network operators with better openness and operator-collected datasets with a higher temporal resolution in future.

Extracting Geographical Coordinates from CDR

The aforementioned charging events help the collection of human footprints in the literature. Particularly, in the CDR of these charging events, there are specific parameters which provide those human footprints, named and defined in the 3GPP specification [85, TS 29.002]. With respect to the provided information, we categorize these parameters into three classes, i.e., who (Subscriber Equipment Number, Served IMEI, or Served IMSI), when (Event time stamps), and where (Cell Identifier or Location Estimate) [85, TS 29.002].
The where parameters need some explanation. The cell identifier parameter is the digit identifier of the serving sector, antenna, or cell tower that a mobile device communicates with.
It is included as a mandatory parameter in the CDR of all the charging events mentioned above, while it is not a geographical coordinate. The location estimate parameter represents the geographical coordinate of a mobile device, but is only included in the CDR of a location request which is not a common trigger of the collection of human footprints. Therefore, in most of the operator-collected datasets, the dataset users or the data providers have to convert those cell identifiers to actual geographical coordinates. In the following, we summarize the major means of the conversion.

Table of contents :

1 Introduction
1.1 Predicting Per-user Mobile Data Traffic
1.2 Utilizing Operator-collected Datasets
1.3 Contributions and Thesis Outline
2 Background
2.1 Mobile Data Traffic Prediction
2.2 Operator-collected Mobility Data Utilization
2.3 Summary
3 Datasets: Characteristics and Challenges
3.1 Human Behavior Collection
3.2 Operator-collected Large-scale Datasets
3.3 Application-based Mobility Datasets
3.4 Challenge of Completeness
3.5 Challenge of Mobility Measurement
3.6 Challenge of Data Processing
3.7 Summary
4 CDR-based Trajectory Completion
4.1 Terminology
4.2 Completing Instant CDR-based Trajectories
4.3 Completing Slotted CDR-based Trajectories
4.4 Summary
5 Per-User Mobile Data Traffic Prediction
5.1 Terminology and Definitions
5.2 Characterizing Individual Mobile Data Traffic
5.3 Constructing Per-user Spatiotemporal Behavioral Data
5.4 Investigation through Temporal Dynamics
5.5 Investigation through Spatiotemporal Dynamics
5.6 Additional Investigation of Human Mobility
5.7 Summary
6 Conclusion and Outlook
6.1 Summary of the Thesis
6.2 Limitation and Outlook
6.3 Concluding Remarks
Bibliography