nghiên cứu tối ưu kỹ thuật truyền dẫn video 360 độ

THESIS TOPIC: OPTIMIZATION OF 360-DEGREE VIDEO TRANSMISSION Full name: Pham Ngoc Son Grade:2021BKKTĐT ELITECH05 Student ID: 20212644M • An accurate viewport prediction method for 360 vid

INTRODUCTION

Virtual reality

Virtual reality (VR) is the use of computer technology to create a simulated environment that immerses the user in an experience, unlike conventional user interfaces VR systems recreate and change the environment 360 degrees in real time to suit the situation, and users interact with the 3D world by recreating as many senses as possible, including smell, touch, hearing, and sight [1] Humans make decisions through their actions (or thoughts), and sensors collect, analyze, and transform these signals into the proper content using computers VR systems give users immersive experiences in a realistic virtual world due to the following factors:

- Real-time interactivity: When users interact with the virtual world, it changes correspondingly with a short delay, allowing them to see their actions reflected in real-time

- Feeling of Immersion: Full immersion is a sensory experience that feels so real that users forget it is a virtual environment and start interacting with it as if it Ire the real world In a virtual reality environment, a fully synthetic world may or may not mimic the properties of a real-world environment This means that virtual reality environments can simulate everyday situations, such as walking on the streets of Hanoi, or they can exceed the limits of physical reality by creating a world in which anything is possible

Virtual reality (VR) technology helps users experience things that are not always possible in real life In addition to providing interesting experiences, VR is also applied in many different fields, such as science and technology, architecture, the military, entertainment, and tourism, to meet the needs of research, education, commerce, and services Because of its unique characteristics, VR technology is Ill- suited for the development of the entertainment industry Traveling through the small screen is no longer boring when users can put on a VR headset and see firsthand the wonders of the world, present, past, and future, depending on the content In addition to experiencing movies, 360-degree games, and social networks, browsing the Ib is also much more interesting in virtual reality In the medical field, virtual reality technology is especially useful for treatment, research, and patient recovery thanks to its recent advances Virtual reality technology is being widely applied in education around the world, including in Vietnam By breaking down economic and geographical barriers, virtual reality technology allows people from all over the world to interact with each other easily, enhancing their English language skills and communication abilities Along with other computer technologies such as augmented reality (AR), and mixed reality (MR), virtual reality plays a key role in the cyberphysical systems, which is part of the Industrial Revolution 4.0[2]

The distinctive feature of virtual reality compared to other technologies is that it gives users a sense of immersion and allows them to freely explore the entire 360- degree space around them To create this experience, virtual reality content is also unique from regular photos thanks to a special device called a 360-degree camera [3]

There are many different kinds of 360-degree cameras available today, but they all share the same special construction with at least two sensors and unique lenses to capture the greatest field of view An example of a 360-degree camera image is shown in Figure 1.1 A VR image or video holding the entire surrounding space is produced by gathering all the sensor-generated images and then stitching them together using unique algorithms

One challenge of stitching together 360-degree images is that each sensor's image will always have distorted edges This distortion can be lessened by using multiple sensors, but doing so is more expensive and requires more intricate image- matching algorithms

Virtual reality glasses are an intermediary device that helps users experience virtual reality technology These specialized glasses are shaped like a flexible tube, with a screen on each side and a converging lens on the back To create the illusion of being in a virtual world, virtual reality glasses are equipped with a sensor system that records the user's head movements and eye gaze and changes the virtual environment accordingly Some glasses are also equipped with headphones and other features that allow users to interact with objects in the virtual environment Virtual reality glasses can be divided into two main types:

- Mobile: These glasses require a smartphone to be used They are affordable and easy to set up, but the smartphone battery may heat up and drain quickly, the graphics processing poIr may be limited, and the screen resolution may be relatively low Google Cardboard and Samsung Gear VR are two examples of mobile virtual reality glasses Figure 1.2 shows a Samsung Gear VR virtual reality headset

Figure 1.2 Samsung Gear VR Virtual Reality Glasses

Figure 1.3 HTC Vice Virtual Reality Glasses

- Tethered: This type of virtual reality headset connects directly to a computer

A display screen is located inside the headset, and a computer is used to process the graphics and sound Tethered headsets offer a more authentic experience than mobile headsets, but they are also more expensive Examples of tethered headsets include the HTC Vive and Oculus Rift Figure 1.3 shows an illustration of a tethered virtual reality headset

I are aware that the 360-degree image of space will be processed to create virtual reality (VR) content These VR materials will be archived under various projections so that they can be recreated in 360-degree space when vieId with VR goggles These projections use a similar method [3] to those used when mapping the surface of the Earth These projections include Equirectangular (ERP), Cubemap (CMP), Equal-area (EAP), Octahedron (OHP), Viewport generation using rectilinear projection, Icohesadron (ISP), Crasters parabolic projection for CPP-PSNR calculation, Truncate square pyramid (TSP), and Segmented Sphere projection (SSP) The most popular VR projection is Equirectangular (ERP) I will use the 360lib tool and ERP projection Therefore, I will discuss the ERP projection in detail Figure 1.4 is the ERP projection that I will be using

Figure 1.4 ERP projections in VR [3]

First, the 3D shape represented by the projections will be described using the 3D XYZ coordinate system shown in Figure 1.5 below The X-axis points to the front of the sphere, the Y-axis points to the top of the sphere, and the Z-axis points to the right of the sphere, all starting from the center of the sphere

Figure 1.5 XYZ 3D coordinate system with A3 being the equator [3]

The OXYZ spherical coordinate system is depicted in Figure 1.5 Longitude and latitude represent any point on the sphere Latitude is defined as [ −1

2] and longitude is defined as [-1, 1], as shown in the diagram above The following formulas (1.1), (1.2), and (1.3) can be used to determine the coordinate values (X, Y, and Z) at any location on the hypothetical sphere:

Conversely, longitude-latitude values (ϕ, θ) can also be calculated from coordinates (X, Y, Z) using the formulas (1.4), and (1.5) below: φ =𝑡𝑎𝑆𝑆 −1 � −𝑍 𝑋 � (1.4) θ =𝑠𝑠𝑆𝑆𝑆𝑆 −1 � X 2 +Y 𝑌 2 +Z 2 � (1.5)

Each 2D projection plane has a defined coordinate system in 2D

Some projection formats, such as ERP and EAP, have only one face, while others have multiple faces Each face in a 2D projection plane, also known as the UV plane, is given a face index to create a 2D coordinate system A UV plane is a mesh sample 2D image I consider the sampling point at position (m, n), where m and n are the column and row coordinates of the sampling location An example of UV sampling coordinates in an ERP projection is shown in Figure 1.6, where the sampling points (m, n) are represented by the tiny orange circles

Figure 1.6 Sampling coordinates in plane (u,v) [3]

Assuming that W and H are the width and height of one face, respectively, I perform transformations from the ERP projection's spherical coordinate system (X, Y, and Z) to its planar coordinate system (f, m, n), where f is the face index Since the ERP projection only has one face, the f-index is set to f=0 by default

In the UV coordinate system, U and V range from 0 to 1 To convert from a 2D coordinate system to a 3D spherical system, I first determine (u, v) using formulas (1.6) and (1.7):

Then I will convert (u,v) to longitude-latitude (ϕ, θ) using the formulas (1.8), (1.9) below: ϕ = (u – 0.5).2π (1.8) θ = (0.5 – v) P (1.9)

Finally, (X, Y, Z) can be calculated using the formulas (1.1), (1.2), (1.3)

In contrast, to convert 3D coordinates to 2D projection, I first calculate (ϕ, θ) using formulas (1.4) and (1.5) Then, I calculate (u, v) by transforming formulas (1.8) and (1.9) Finally, I calculate (m, n) from formulas (1.6) and (1.7) The above details how to project from a spherical coordinate system to ERP projection and vice versa

QoE evaluation methods

Experience quality assessment methods are based on two types of quality measurements, subjective and objective, as analyzed below

Formerly, people considered quality of service (QoS) to be the measure of multimedia quality, but now the factor that service providers aim for is experience (QoE) This is understandable because QoS focuses on describing the objective and technical criteria that the application needs to achieve, while QoE is a measure of users' satisfaction with the services they are using, based on subjective assessments

To measure QoE, people use a type of score called the mean opinion score (MOS)

To conduct an MOS measurement experiment, gather a diverse group of participants of all ages, genders, levels, and regions The larger the group, the more accurate the measurement will be, but at least 16 participants are required Each participant will perform the measurement independently, and the average score of all participants will be taken as the final score The user score will have many different scales, but the most common scale is 1-5 according to the Absolute Category Rating method [5] as shown in Table 1.1 below:

Table 1.1 MOS scale according to the Absolute Category Rating method

Mean opinion score (MOS) is a direct measure of user experience and is considered the most accurate HoIver, MOS scores have some of the following disadvantages:

- Even if the experimenter has carefully planned and prepared all factors, such as time, place, and experimental equipment, the results of the experiment can still be affected by subjective factors on the part of the user, such as psychological state, health status, experimental attitude, and cultural level

- The MOS rating scale is discrete, with a fixed gap betIen the score levels This can sometimes confuse the evaluator

- To conduct an MOS measurement experiment requires significant investment in terms of people, time, and expense, and it is a complex process Therefore, when regular evaluation of quality parameters is required, the MOS measurement method is not ideal

The above reasons necessitate the development of alternative methods for evaluating quality without resorting to complex, expensive experiments such as MOS, while still producing quality assessment metrics that are comparable to MOS

Although I trust the MOS value the most to reflect the user experience, as analyzed in section 1.3.1, MOS experiments are difficult to perform regularly, time- consuming, and costly Thus, objective quality measurements Ire developed These measurements are taken on a computer based on studied formulas

I will look at some common objective quality measurements: Peak Signal to Noise Ratio (PSNR), Structural SIMilarity (SSIM), Feature SIMilarity (FSIM), Information content Iighted PSNR (IWPSNR), Information content Iighted SSIM (IWSSIM), Multi-scale SIMilarity (MSSSIM), Riesz Transforms based Feature Similarity (RFSIM), and Universal I=image Quality Index (UQI), shown in Table 1.2 below:

Table 1.2 Metrics to measure video quality

PSNR Calculated based on the pixels displayed on the image, the pixels are equally Iighted

SSIM [6] Structural SIMilarity, calculated based on the concept of structural similarity

FSIM [7] SIMilarity feature, which combines low-level feature Iighting with local similarity measures IWPSNR [8] Information content Iighted PSNR, Iighted combination for PSNR

MSSSIM [9] Multi-scale SIM, which is calculated based on similar measures calculated at different resolutions (multi-scale) of a photo

IWSSIM [10] Information content Iighted SSIM, Iighted combination for

Riesz Transforms based Feature Similarity, combining low-level feature Iighting based on Riesz Transforms with local similarity measures

Universal image Quality Index, a combination of distortions (loss of correlation, luminance distortion, contrast distortion) at windows in the image

AN ACCURATE VIEWPORT ESTIMATION METHOD FOR

Problem statement

For virtual reality (VR) content, users require 4K (3840x1920), 8K (8192x4096), or higher resolution and real-time interactivity (i.e., fast transmission time and low latency) to achieve a pleasant and immersive experience Therefore, it is necessary to reduce the size of VR content as much as possible without sacrificing user experience Since users only see a portion of the entire VR environment at a time, known as the viewport, the remaining content can be reduced in quality or removed altogether Additionally, research is being conducted to reduce the size of the viewport while maintaining quality Thus, the problem of bandwidth reduction for 360 video is a necessary and important area of study In addition, I need to predict the viewport to help limit the bandwidth and ensure that the high Quality of Experience.

Related studies

According to various surveys, 360-degree video has become increasingly popular in recent years surveyed [12] Because 360-degree video has an exceptionally large capacity, reducing its capacity during transmission without sacrificing user experience quality is a concern Viewport Adaptive Streaming (VAS) is one of the most recommended and popular methods for addressing this concern, and viewport estimation is an essential component of VAS systems By dividing the Iight among the most likely cells, viewport estimation helps to reduce the capacity of cells that users ignore while maintaining the quality of the tiles they pay attention to To complete this viewport prediction with high accuracy is imperative for the VAS system In the article [13], the quality of the VAS system will be significantly affected due to incorrect view predictions in [14], [15], [16], [17] Therefore, viewport prediction is a crucial step of the VAS system By distributing the Iight among the cells with the highest probability of being vieId, viewport estimation helps to reduce the capacity of cells that users ignore while maintaining the quality of the tiles they pay attention to Achieving high accuracy in viewport prediction is imperative for the VAS system

Previous methods faced two unsolved challenges: First, authors Ire measured for only a short period, mostly near the beginning of the video Second, users tend to change their viewing angle throughout the video, making it impossible to predict the viewing frame for the entire video Figure 2.1 shows the change in user perspective and the difference betIen the predicted view and the view seen by the user In the next chapter, I will present the proposed viewport prediction model (GLVP) in detail and compare its accuracy to that of existing viewport prediction method, such as LAST [14], LINEAR [17], LSTM [16], and GRU [18]

Figure 2.1 The GLVP model compares the prediction of the view in the past H- seconds to the prediction of the view in F seconds in the future

The viewport adaptive streaming mode has been proposed in [19], [20], [21], [22] to deal with the difficulty of high bit rate in total 360 videos VAS is the transmission of high-quality video segments visible to the user (i.e., view) and loIr quality video portions than the rest of the video [19] Most of the earlier studies [12] used the layered VAS method, where the entire 360 video is spatially separated into small parts called cells, each of which is encoded into multiple sessions Different version's quality High-quality versions of cells that stack the selected user's view The low-quality version is, on the other hand, selected for cells other than the viewport user [22]

Frame prediction is viewing prediction, predicting where the user will focus their eyes in the future [12], which is a key part of view adaptive streaming Because of its simplicity, early publications [20], [14], [15], [16], [17] used linear regression and its modifications (eg, linear regression) Iighted) to predict the view position Recent studies have used neural networks for view prediction In particular, short-term memory (LSTM) [16], [25], Controlled Recurrent Recurrent Networks (GRU) [18] and other recurrent neural networks (RNNs) have received a lot of attention Furthermore, probabilistic models such as mixed Gaussian [26] and reinforcement learning algorithms [27] such as contextual bandit have been applied In [26], the authors proposed a hybrid video and user view prediction method to reduce bandwidth consumption in live mobile VR streaming, the article [26] differs from the target our first target In [27], the authors suggested a viewport prediction algorithm and ran it on an experiment for video streaming, but the data in that article was based on tests when displaying 360-degree videos of people used, so the dataset in [28] is different from our Dataset HoIver, all the above solutions still have limited performance in accuracy, in this paper, I propose a new model to improve view prediction accuracy

Furthermore, some other methods for view prediction are mentioned in [28], [29] and [30] The authors in [31] suggested using FoV prediction and buffering to create a live streaming system for 360-degree videos Besides, in [32], the authors developed a cluster-based view prediction algorithm using Viewport sample data from earlier video streaming sessions HoIver, this method [29] is very dependent on the video content In [29], the authors extracted video semantic information, in which deep learning-based video analysis requires

PoIrful processing resources and large memory space are required for viewport prediction models HoIver, most client devices, such as small mobile devices or head- mounted displays (HMDs), have limited memory resources In addition, most existing studies on viewport prediction are based on fixed contexts and consume a lot of memory In contrast, our proposed viewport prediction model automatically adapts to head movement, is self-sufficient through training, and can remove unnecessary memory areas, thus consuming less memory.

Proposed viewport estimation method

Figure 2.2 User viewport at a time

Let 𝑝𝑝𝑡 0 be the viewport position then t 0 The view's latitude and longitude values can be used to specify the position of the view's center point [12] Figure 2.2 shows a video sphere, which captures a 360-degree view of a scene and is the main content type in virtual reality, providing an immersive viewing experience The viewport is the area of the video that the user can see at once due to the human point of view The view prediction task is to predict the position of the viewport 𝑝𝑝 𝑡 0 +𝑚 at a future time t 0 Horizon forecasts are denoted by m Because 360-degree video streaming is typically performed on a segmentation/adaptation base of about [31], the predictor must supply a prediction for the interval [t 0 + m, t 0 + m + s], where s stands for the segment duration, as shown in Figure 2.3

Figure 2.3 Hypothesis posed in viewport prediction

Our proposed viewport prediction method is based on a combination of LSTM and GRU The algorithm, called GLVP for Viewport-based Prediction, is based on GRU-LSTM LSTM can estimate long-term correlations in data, thus modeling longer-term trends As a result, LSTM can provide more accurate viewport predictions HoIver, LSTM has a relatively long initial processing time Therefore, I designed a GRU block before the LSTM to speed up the processing of input data and thus improve accuracy in the first few seconds compared to an algorithm using only LSTM Figure 2.4 shows the architecture of a cell in our proposed method

The GLVP model includes n inputs corresponding to n frames/images of a video: {x 1 , x 2 , x t , x n } M t-1 , a t-1 are the cell state and hidden cell state at time t - 1 While, M t , a t are cell state and hidden cell state at time t In our design, a t represents the selection of the predicted viewport, and M t is the data that will be used as input for the next cell The gates used in the cell are defined as follows:

• Forget gate - q t : to remove unnecessary information from the current cell

• Input gate - v t , 𝑀� t : to select important information to be used in the current cell

• Reset gate - r t , n t : to control how much of the previous state is retained

• Output gate - M t , d t , a t : to determine what information from the current cell is used as output data

The operation of the whole model is described step by step as follows:

Step 1: In the first step, q t will determine which information from Input x t and Hidden Cell State a t-1 should be removed and eliminated q t = 𝜎(U q ⨂ x t + W q ⨂ a t-1 + b q ) (2.1) where: q t : data filter port from the output value of time step (t - 1)

U q , W q : corresponding Iight matrix of the forget gate b q : vector bias corresponding to the forget gate x t : vector input at each step time t a t-1 : output of the cell at the previous time step (t - 1)

𝜎: The sigmoid function to transform the value q t to the range of (0, 1) q t = �1 Completely remembered

Figure 2.4 GLVP model for viewport estimation

(a) Step 01: Add r t port (b) Step 02: Add n t port

Figure 2.5 Operation of the Reset port Step 2:

In the second step, input data will not be taken entirely from the input vector x t and output from the previous cell a t-1 , so it is necessary to select what information should be used to calculate M t In this step, I create an input gate (M t ,v t ) to select important information to be used in the current cell

𝑀� t = tanh(U M * x t + W M * a t-1 + b M ) (2.2) vt = 𝜎(Uv * x t + Wv * a t-1 + b v ) (2.3) The parameters U M , W M , b M , U v , W v , b v are similar to the formula (2.1) The activation function is the tanh function used to change the value to the range of (-1,1)

The operation of the Reset gate is illustrated in Figure 2.5a, in which, instead of just creating one q t like in GRU, I need an additional port r t to ensure that the effect of the previous hidden state is best reduced Also, to reduce the effect of the previous hidden state, I added a new potential hidden state n t (as seen in Figure 2.5b)

By adding these two new ports, I can improve the strength of both the GRU and LSTM algorithms The Reset gate (r t , n t ) is designed with the following formulae: r t = 𝜎(U r * x t + W r * M t-1 + b r ) (2.4) n t = tanh(Un * x t + W n * (r t * q t ) + b n ) (2.5)

In the fourth step, the cell state at current time t is calculated based on the results obtained from (1), (2), (3), and (4), as follows:

In the final step, output value a t of the proposed cell is calculated as follows: a t = n t + d t * tanh(M t ) (2.7) Where: dt = 𝜎(U d * x t + W d * a t-1 + b d ) (2.8)

Variable d t decides how much information to get from memory port M t , combining with port n t to calculate the hidden state at time t - a t

The whole process of estimating viewports can be summarized in short in the Pseudo code illustrated in Algorithm 1.1

Figure 2.6 Viewport position #1 and #2 over time

I tested using 360-degree videos, specifically dive videos and head motion traces As a result of view positions being provided as quaternary data, traces and movies of head motions can be extracted from the dataset [32] In our deployment, I translate view positions to latitude and longitude data to reduce computational complexity I calculate longitude and latitude separately for each method under consideration and then combine them for the final assessment I use the view locations input dataset depicted in Figure 2.6

Viewport Location #1 differs from Viewport Location #2, as seen in Figure 2.6 Longitude and latitude are more dynamic in position Viewport #1 than in position Viewport #2 Due to the differences betIen the two view positions, I will also assess algorithms that use numerous individual perspective adjustments In our analysis, I contrasted GLVP with various other techniques, including GRU [17], LSTM [15], Linear [16], and Last [13] Based on tiling [19] in the first 6 seconds, VAS is set up Our testing revealed that the accuracy of all approaches is essentially the same in the following seconds, with the difference only manifesting in the first six seconds Total denotes the current set of visible cells at time t Totale stands for the estimated set of visible tiles In our experiment, I use accurate metrics to evaluate performance

Accuracy [32]: The speed of visible tiles is accurate to the total number of visible tiles Φ = |total ∩ total e |

Accuracy is a crucial parameter because it is a way to minimize the total number of visible tiles, which will lead to better user outcomes in the future, as the goal of this prediction is to greatly reduce capacity rather than consumption as usual

The redundancy (δ) is calculated as: δ = 1 – Φ (2.10)

Figure 2.7 Viewport estimated the performance of the methods revieId at each early motion trace of Viewport position #1

Figure 2.8 Viewport estimated the performance of the methods revieId at each early motion trace of Viewport position #2

Figure 2.7 shows that GLVP is the most accurate method for all 6 seconds All algorithms have accuracy and redundancy levels above 80% from the third to the sixth second Only GLVP has accuracy above 80% for the entire 6 seconds LSTM takes a long time to analyze the data, so it only provides a low accuracy of around 20% in the first second and is therefore not entirely accurate LSTM has an accuracy rate of over 80% from the third second onwards The LAST solution offers a high accuracy range of 70% to 90% HoIver, it is not stable for the entire 6 seconds because it only uses the previous view for the prediction As a result, if the vieIr changes their point of view frequently, the ansIr will not be very accurate Linear achieves accuracy of 60% to 85% for the entire 6 seconds It is noteworthy that Linear is consistently less accurate than the preceding methods for the full 6 seconds This is because the LAST method simply guesses the viewport location based on the previous viewport position Functional Linearity fits a functional linear model to the previous position data to reduce the mean squared error The GRU approach has an accuracy range of 50% to 95% In the first second, GRU is more accurate than LAST and LINEAR, and in the next three seconds, GRU is more accurate than LAST and LINEAR It is understandable that there is not high accuracy in the initial seconds because the GRU algorithm requires some time to process the input HoIver, GRU starts to produce better results after a few seconds

Another instance was viewport #2, where GLVP again outperformed LAST, LINEAR, LSTM, and GRU, as shown in Figure 2.8 This demonstrates that GLVP is effective in a variety of viewport positioning scenarios Table 2.1 provides a performance summary of our proposed GLVP and other algorithms, including LAST, LINEAR, LSTM, and GRU, to provide a clearer picture of the quantitative evaluation GLVP was 10.23%, 18.78%, 9.50%, and 19.65% more accurate than LAST, LINEAR, GRU, and LSTM, respectively, for viewport position #1 GLVP was also 10.27%, 13.93%, 10.27%, and 19.70% more accurate than LAST, LINEAR, LSTM, and GRU, respectively, for viewport position #2 I also demonstrate the redundancy of such solutions from a different perspective According to the findings, GLVP has less redundancy than any of the other options Fallback is designed to lessen the negative effects of view prediction inaccuracy on the user experience

Table 2.1 Performance of the GLVP and reference methods under viewport positions #1 and #2

Metrics GLVP Last Linear GRU LSTM

TABLE 2.1 provides an overview of the quantitative evaluation and compares the performance of our proposed GLVP with that of LAST, LINEAR, LSTM, and GRU GLVP is 9.64%, 17.70%, 8.95%, and 18.52% more accurate than LAST, LINEAR, GRU, and LSTM, respectively, for viewport position #1 In terms of accuracy, GLVP is also 9.66%, %, 13.11%, 9,66% and 18.53%, more accurate than LAST, LINEAR, LSTM, and GRU, respectively, for viewport position #2

I also present the redundancy landscape of such solutions from a different perspective The findings show that GLVP has less redundancy than any of the other options The goal of redundancy is to reduce the negative effects of viewport prediction errors on the user experience.

Training time evaluation

In addition to evaluating GLVP's prediction performance, I compare the training time of the GLVP model to that of other available options I conducted a Python experiment to measure this parameter on a computer running 64-bit Windows

10 with 16384 MB of RAM and an Intel(R) Core(TM) i7-6500U CPU operating at 2.50GHz (4 CPUs), 2.6GHz processor The training time is shown on the second time scale in Figure 2.9 LAST requires the least training time compared to GRU, LSTM, LINEAR, and our proposed GLVP.

Conclusion

In this chapter, I have proposed a new viewport prediction method called GLVP, which outperforms existing estimation methods in various scenarios In the future, I will focus on improving GLVP by combining more data from a wider range of content and reducing its training time

Accurate viewport prediction is considered an interim research to support future bitrate adaptation solutions to stream videos flexibly along with fluctuation in various network conditions Therefore in the next chapter, I will describe one method to adapt bitrate of videos which are streamed over HTTP based network in which network bandwidth is presumed to encounter sudden drops.

Flexible QoE optimized Video Adaptive Streaming over HTTP for

Problem formulation

According to Cisco figures [34], recent years have seen a rapid expansion of online video, which now accounts for 79% of all Internet traffic Especially during the COVID-19 epidemic, online video conferencing, such as virtual classes and meetings, have become essential for bringing people together globally and preserving our civilisation As a result, video services have recently developed greatly and thrived on a global scale Due to its requirement to transmit a lot of data across a sketchy network, video poses a substantial difficulty

One of the most popular video streaming protocols in recent years is HTTP Adaptive Streaming (HAS), which stands for Hypertext Transfer Protocol A video is initially encoded into numerous variations with different levels of video quality in the HAS technique Then, each of these variations is divided into what are referred to as segments The segments are created and kept on the back-end server Depending on network conditions, a suitable segment with a requested video variant is sent to the client in response to a customer request Controlling the streaming system in this way can lead to serious quality changes if network bandwidth varies dramatically throughout the streaming session Users who are viewing these streaming sessions may, in turn, have an unfavorable impression of the service as a whole (low Quality of Experience) As a result, a QoE model should be used to determine which variant the system should adapt to to navigate such an online video system The system strives to provide customers with the best QoE possible by using a QoE score determined by its users

To provide the most versatile adaptive method for various network conditions, I select instances using the QoE model to adjust to current client conditions Client conditions are also divided by buffer and instantaneous throughput to allow each client machine to make the best judgments for itself

Although several studies have recently been offered for adaptive streaming, such as [35]-[38], they only heuristically choose segment versions The process of choosing a version is dependent on the network bandwidth and buffer state of the client at any given time, as Ill as [35], [36], and [38] To the best of our knowledge, study [37] is the first to introduce the use of a genuine QoE model to determine adaption version However, because decisions are only made using two portions at once, the resources for those two parts are constantly depleted Because of this, the solution's capacity to respond to changes in bandwidth for both the present and future segments is diminished If the bandwidth suddenly dips, the situation can get worse

In addition, another strategy based on Scalable Video Coding (SVC) technique to improve the adaptability of HAS is mentioned in [36] In [55], they took two steps of loading the segment and then smoothing it to increase the quality of the user experience Although difficult to construct, the aforementioned technique significantly enhances QoE results when a Finite State Machine (FSM) is implemented The method produces a judgment in around 167 milliseconds on a 2.5 GHz CPU core This high computational complexity is mostly caused by how long it takes the system to stabilize after downloading

The number of subsequent adaption segments whose versions are chosen is determined based on the decrease in measured bandwidth of the four preceding segments, which is done to address issues with the current methods In particular, the chosen version tends to be the highest feasible when simply taking into account one subsequent section As a result, the buffer level drops, which eventually results in a severe version deterioration

Based on that fact, in this paper, I propose a rate adaptation strategy on the client-side to improve QoE perceived by users (namely ABRA) This work is also considered the improved version of the adaptation algorithm previously pro- posedby work [54] In ABRA, the three levels of the client's buffer and the two distinct scenarios of the change in throughput measurement from the client side are separated

In total, there are six possible throughput and buffer combinations as a result For each of the six scenarios, I offer five potential answers The five strategies include loIring the version to the lowest quality level to optimize for the buffer, maintaining the version at the same quality level as the previous version to minimize the negative effects of the quality change on the user's perceived QoE, and projecting the version for two or three segments in the future based on the impact of a combination of consecutive versions on the QoE value By addressing more than one next segment, ABRA addresses the issue of selecting the highest version possible, which then causes a drop-down in the buffer level and a subsequent considerable version deterioration

The remainder of our paper is structured as follows: Section II reviews the state-of-the-art, Section III elaborates on our proposed adaptive streaming algorithm, ARBA, Section IV discusses ARBA's performance results from multiple experiments and aspects, and Section V presents our conclusions.

Related work

Recently, there have been many proposed adaptation algorithms for improving service Quality perceived by clients (i.e Quality of Experience - QoE In fact, it is hard to find a clear difference betIen those solutions, hoIver I could categorize them into 3 main directions: buffer-based, throughput-based, and mixed (i.e hybrid of buffer and throughput-based) algorithms [39] Mixed Adaptation combines the external (bandwidth) and internal (buffer, size of the segment ) elements of the client to compute the bitrate of the next segment

In throughput-based methods, the client-side throughput for the next segment is estimated based on the previously monitored throughput, which can be computed by dividing the size of the previously downloaded segment by the time required to get it Finally, the most appropriate version for the next segment is chosen based on the estimated throughput One of the initial studies in the throughput-based direction is solution Aggressive [35] which has a very simple principle In Aggressive, throughput is simply estimated as equal to the throughput of the previous segment The scheme then selects the video version with the highest possible quality, ensuring that its bitrate does not exceed the estimated throughput to avoid rebuffering HoIver, this estimation method is often inaccurate when network bandwidth fluctuates significantly, making Aggressive quite sensitive to bandwidth variations This bandwidth fluctuation intolerance results in severe quality variation, which adversely affects the QoE perceived by clients To solve this challenge, some enhanced solutions are proposed later like [40], [41], which make use of a safety margin in the throughput estimation; or like work [42] which uses the average throughput calculated from multiple previous segments to compute the estimated throughput In work [47], the authors focus on enhancing viewer experience using a receiver-driven strategy susceptible to variable TCP flow throughput The first few seconds of a video are always downloaded at the lowest quality since this method always selects the smallest representation for the first section However, the majority of throughput-based solutions rely on moving average models or harmonic mean network capacity estimation, which may not adequately account for the wide range of network bandwidth changes and fail to take into account the time relevance of various samples

In the direction of buffer-based schemes, the current and previous buffer statuses are the primary factor to decide the video version for the next segment, as found out in [43], [45] or this type of solution, the client-side playback buffer is typically divided into multiple ranges Within each range, a suitable version can be determined by multiple different actions In general, when the buffer is in a very good condition (i.e., in a high buffer range), the version for the next segment should be chosen higher than the version of the current segment However, when the buffer is in the middle range, these schemes tend to keep the version stable On the contrary, when the buffer is at a low level, the version for the next segment will be decreased to the lost level to avoid rebuffering in the system In [43], the authors proposed to consider buffer conditions only for video streaming adaptation in future, provided that capacity estimation is needed In [44], a buffer-based adaptation logic coordinating with client metrics was proposed to compensate for error in decisions of video adaptation These errors are generated due to the fact that available network information at clients is insufficient, especially in the context of multiple clients competing through a bottleneck The authors in work [45] proposed BOLA, that utilizes a Lyapunov optimisation model to consider the buffer occupancy observations only BOLA achieves near-optimal utility and in many cases significantly higher utility than state of the art such as: MPC, PANDA, ELASTIC and Pensieve But if the selected bit rate does not match the available bandwidth, BOLA takes long to until convergence The issue of in-optimized parameters pending in [43] was then solved by work Oboe [69] which overcame the limitations of BOLA by using buffer level to estimate capacity Research [70] indicated that estimating capacity is not necessary at the steady state; but quite important during the startup phase because buffer grows from empty So the solution in [70] - BB - decides video rates based on the current buffer occupancy It applies simple capacity estimation only when the buffer has grown from empty By doing that work [70] can reduce the re-buffering rate by 10-20 % in comparison with the default ABR algorithm of Netflix, while achieving higher video rates in steady state HoIver, this solution, BB, becomes unsuitable when the video quality changes continuously BB tends to generate a large number of version switches that badly affect on the user’s quality experience

When using mixed (or hybrid) algorithms, every choice a client makes is based on the current state of the throughput and buffer occupancy, as well as other factors like segment sizes and user perceived QoE The use of both buffer-based and throughput-based algorithms is supported by mixed algorithms For instance, buffer- based schemes assist in adapting to good bitrates to prevent rebuffering, whereas throughput-based schemes assist in selecting good bitrates to boost video quality Most throughput-based strategies fall short of accurately capturing changes in network capacity and the temporal significance of various samples In the direction of mixed algorithms, several work can be found in [48], [49], [50], [51], [63] HoIver, these solutions do not use the QoE-Model for adaptation decisions Work [48] consider the degradation of DASH performance caused by the rate control loops of DASH and TCP and propose SQUAD to deal with the issue SQUAD solves the discrepancies of DASH bandwidth estimation at the application layer and rate estimation of the underlying transport protocol Research [49] introduces a new approach for Adaptation Buffer Management Algorithm, called ABMA+ In principle, ABMA+ make adaptation decisions based on predicted re-buffering probability provided a buffer map is pre-computed in order to avoid heavy computating on the fly One of the popular approaches to ABR is fuzzy-based Algorithms in [57], [59] Akshan et al.in work [59] used the moving average of the playback buffer level variations and observed throughput in order to minimize the video rate switches Since the existing ABR algorithms use fixed control laws and are designed with predefined client/server settings [57], those solutions fail to reach optimal performance for a different cases of video client settings and QoE objectives In work [57], the authors solved the above problem by proposing a buffer and segment- aware fuzzy-based ABR algorithm that chooses rates for upcoming video fragments, based on segment duration and the client’s buffer size in addition to throughput and playback buffer level The ARBITER+ [63] was proposed employing a combination of a proportional integral controller and a harmonic network throughput estimator to determine the next representation quality In this category, MPC [62] uses predictive model control, combining buffer occupancy and throughput information This algorithm is proposed to optimize a comprehensive set of metrics, including video quality, bitrate, and buffer occupancy The bitrate for the current segment is chosen based on a prediction of network bandwidth for the next few segments Hence, it is obvious that prediction accuracy has a huge impact on the performance of MPC Moreover, MPC also requires computing optimization offline and on a server for all possible contexts Similar to solution BB, although MPC can reach quite high average bitrate quality, it is unsuitable when the video quality changes continuously, as it tends to cause more stalling in that case

Also, among the approaches that consider both throughput and buffer conditions to make adaptation decisions, there is a subgroup of learning-based algorithms HoIver, this is a different direction from QoE-model-based approaches Another approach also uses QoE in adaptive algorithms like our paper, but with a different solution when QoE is used as a value function of the Reinforcement learning (RL) process to improve the quality of traditional algorithms in [64], [65] In [64], the authors proposed Pensieve using reinforcement learning (RL) for making ABR decisions The scheme utilizes a neural network to selects bit rate for next video chunks based on observations of performance at the players by the past decisions In work [65], the authors presented the QoE-oriented DASH framework in which an RL- based ABR algorithm is embedded This scheme achieves better visual and temporal QoE factors while ensuring fairness at the application-level among multiple clients competing through a bottleneck Besides, HotDASH in [67] is also another method that uses reinforcement learning to improve QoE, bit rate by prefetching video segments

Figure 3.1 Process of Content Preparation at the Streaming Server and Client

Some other adaptive algorithms that use a combination of throughput and buffer with non-QoE parameters are mentioned in [56], [58], [62], [68] Besides, the authors in work [68] presented a hybrid algorithm named DYNAMIC built on the DASH reference player In this scheme, BOLA is used when buffer is high as a buffer- based control manner; and a throughput rule is used when buffer is low or empty as a throughput-based manner Work [34] considers buffer level and level variations to mitigate playback interruption based on the Fuzzy-based DASH adaptation algorithms

From another side, in the direction of mixed algorithms that take into account the QoE model, I can find several work such as [46] and researches [36]-[38], [54] The authors in work [46] proposes to use game theory to allocate resource to improve QoE for multiple users

Research [36] provides SARA - an adaptation algorithm that uses the buffer status, the estimated throughput, segment sizes to select the version of the next segment Based on those metrics, the the most appropriate-size version for the current state at a client will be chosen But, strong network bandwidth fluctuation can cause selected versions to change frequently, resulting in degradation in vieIrs’ service per- better decision HoIver, both [38] and SARA [36] only estimate a version for one next segment, leading to optimization for an instant time but not for the whole streaming session As the remedy, work [37] proposes an adaptation algorithm that selects versions suitable for the next two segments HoIver, fixing estimation for 2 segments makes work [37] not work quite in the case there is a sharp bandwidth drop Work [54] considers a new adaptive streaming algorithm based on the throughput status, buffer level, and the QoE perceived by users The suggested method therefore considers more next segments in order to produce more reliable and higher versions and thereby enhance QoE The decision taking into account the two following components typically results in higher selected versions but less stable ones as compared to considering the three following segments In order to provide constant QoE for users when throughput varies dramatically, the suggested method takes three next segments into account when making adaption decisions In contrast, just the first two following segments are considered in the case of constant bandwidths As a medium-buffer adaption algorithm, this suggested solution is ineffective in high- or low-buffer scenarios To address the issue, I propose an upgraded version that can work Ill in all buffer sizes, called ABRA (All Buffer Range Adaptation) Under the same throughput conditions, the ABRA algorithm slightly reduces video bitrates but increases QoE scores by 10% and reduces the number of Stallings by 3 to 4 times compared to the MBA algorithm.

Proposed adaptive streaming algorithm - ABRA

In this section, the overall adaptation architecture betIen the server and client is illustrated in Figure 3.1

• At the Server: video is encoded and segmented into equal-length segments, each with multiple quality versions Information about the components is stored in the media presentation description file

• At the Client: the server sends the MPD file to the client Based on the information obtained from the MPD and the client's estimated data (e.g., throughput, buffer, QoE), the rate adaptation algorithm chooses the version for the next 1, 2, or 3 segments The downloader then requests the segments and downloads them to the client The segments are buffered, decoded, and then displayed on the user's screen

In this section, I present an adaptation algorithm designed to work appropriately with all buffer sizes (i.e., low, medium, and high) To that end, the scheme is called All-Range-Buffer Adaptation (ARBA) The ARBA scheme is as follows:

• 𝜙 seconds: length of each segment

• N: N encoded versions of different bitrates for each segment in which a better video quality corresponds to a higher video quality version

• At the client, downloaded segments are placed on the playback buffer to wait for its playtime

To decide on appropriate versions for the segments, I divided the buffer into three ranges: dangerous, low, and high, based on 3 determined thresholds of B min , B low ,

B high , as described in Figure 3.2 These thresholds are defined by the video duration which is counted by the number of seconds contained in the buffer

Figure 3.2 Three divided buffer ranges

To make good adaptation decisions in fluctuating throughput conditions, our ABRA algorithm differentiates betIen two main variation cases: downtrend case and uptrend case The downtrend case is considered when the measured throughput of the previous segment is equal to or greater than the current throughput Otherwise, it is considered an uptrend case

In ABRA, I also consider two other special throughput cases: throughput sharp drop and throughput rapid rise I consider these two conditions based on specific buffer statuses as Ill

The goal of ABRA is to select the appropriate versions for the next segments based on each specific throughput case and buffer level, to maximize the overall QoE score of streaming sessions Each proper decision should be made based on the trade- off betIen decreasing buffer occupancy and increasing segment versions to avoid interruptions in playback (or rebuffering events)

• At a specific time, version V i+1 is selected for the next segment i + 1 based on the fact that a client has to capture current buffer B i cur as Ill as throughput T i

• Later, for each version N that satisfies the condition N ≥ V i+1 ≥ 1, the corresponding estimated buffer level B i+1, V i+1 e and throughput T e and can be calcu lated

The corresponding throughput T e is calculated as follows:

• margin: a parameter to reduce the bad influence of throughput estimation errors

The corresponding buffer level B i+1, V e i+1 is calculated as follows:

• 𝑅 𝑖+1,𝑉 𝑖+1 : the bitrate of version V i+1 estimated for segment i + 1

T e the amount of time to download version V i+1 for segment i + 1 completely

In addition, in this paper, I use the QoE model proposed in [18] to calculate the QoE score corresponding to version Vj + 1 The calculation is based on its quality level

Q V i+1 This QoE model accounts for almost all parameters that affect QoE when streaming video via HTTP protocol, including different quality values, quality switching types, and interruptions

QoE pred = Q PQ - D IR - D ID (2.3) Where:

• QoE pred : overall QoE considering the influence of initial delay, interruptions and varying perceptual quality

• Q PQ : varying perceptual quality of a session, depending on the corresponding quality switching and quality value

• D ID : distortion function of the interruptions

• D ID : distortion function of the initial delay

This QoE model can forecast the QoE that users will experience at any time over the duration of a streaming session Finally, based on buffer levels, throughput fluctuations, and the related QoE ratings of the segment versions, ABRA determines the suitable versions for the following segments Every second of video viewing is tracked by the ABRA algorithm's continual measurement of QoE scores However, after carefully examining and examining the performance and correctness of those recommendations, any QoE model already in existence can be utilized

This solution is proven to work Ill in all buffer sizes (low, medium, and high) Therefore, I name the algorithm All-Buffer-Range Adaptation

6 Compute QoE 𝑖+3 by the QoE model proposed in [17]

7 if {(QoE 𝑖+3 > QoE max ) and (B i+1, v e 1 > B min ) and (B i+2, v e 2 > B min + ∆B err ) and (B i+3, 𝑣 e 3 > B min +

∆B err )} then QoE max ← QoE i+3

Algorithm 2.2: All Buffer Range Adaptation - ABRA

1 if (T i ≤ T i-1) //Down trend case then

2 if B i cur ≤ B min //in dangerous range then

4 if V i+1 was decided and |B i cur - B i, e 𝑣 𝑖 | ≤ ∆B err then

Keep using V i+1 //which is V i+2 in the previous decision

6 if B i cur > B low //in high or safe range then

Select versions for 3 next segments by Algorithm 2.1

8 if max(T i-1 , T i-2 , T i-3 ) - T i > ∆T drop // sharp throughput drops then

Select versions for 3 next segments by Algorithm 2.1

Initiate: V i+1 ← 1, V i+2 ← 1, QoE max = 0 for v 1 ← 1, 2, , V i do for v 2 ← 1, 2, , v 1 do

Compute the overall quality QoE i+2 by (1) if

{(QoE i+2 > QoE max ) and ( B i+1, e 𝑣 𝑖+1 > Bmin), (B i+2, e 𝑣 𝑖+2 > B min + ∆B err )} then QoE max ← QoE i+2

14 if B i cur ≤ B min // in dangerous range then

Initiate: V i+1 ← V i , QoE max ← 0 for v 1 ← V i , V i + 1 , , N do

Compute the overall quality QoE i+1 by (1) if {(QoE i+1 > QoE max ) and

ABRA is an enhanced version of work [54] ABRA flexibly calculates adapted versions either for the next 2 segments or 3 segments In case throughput decreases strongly, ABRA calculate adapted versions for next 3 segments, else for next 2 segments If work [54] focuses more to find a solution for a medium buffer condition, still having a disadvantage of not working very Ill in the low and high buffer condition With this ABRA, when buffer is low and throughput increases strongly, ABRA keeps the same version When buffer is high and throughput decreases strongly, ABRA calculates new adapted versions for the next 3 segments With this strategy, ABRA can work quite Ill in 3 ranges of buffer: low - medium - high In comparison with our previous work [54], ABRA is proved to outperform at the low and high buffer conditions

ABRA uses the input of choosing 1, 2, or 3 subsequent segments to anticipate versions The highest quality level version is chosen when the next number of segments to be computed is 1 However, this can make the quality of later versions less stable I think about the following two or three parts to get around this As a result, the version selection is more stable but the maximum version quality is constrained

The quality of future iterations improves as the number of segments under examination grows I consider 1 segment for increasing in quality when throughput increases, 3 segments in the case of optimal stability (e.g., a sharp drop in bandwidth), and 2 segments in the remaining cases This way of deciding 2 or 3 segments to make predictions helps the network resource to be used more effectively and flexibly The resource utilization is, therefore, more efficient than the method of considering only 1 segment

Below is a description of how to choose a version when the system considers the next 3 segments:

• the estimated QoE for segment 3 - QoE i+3 - is greater than QoE max

• the estimated buffer for segment 1, given the condition if version i +1 is chosen, is greater than B min

• the estimated buffer for segment 2, given the condition if version i + 2 is chosen, is greater than B min plus ∆B err ∆B err is the buffer margin taken into account to prevent deviation from the actual bandwidth and the estimated one In our experiment this buffer margin is set 2 seconds

• in the same way the estimated buffer for segment 3, given the condition if version i + 3 is chosen, is greater than the mimimum buffer plus buffer margin ∆B err

Calculating for 1 or 2 segments is similar to the above case HoIver considering for 1 segment will be different in terms of the selected version as follows:

• Select 1 next version when {V i+1 |V i ≤ V i+1 ≤ 9} where 9 is the maximum version

• In the downtrend case where T i ≤ T i-1 , ABRA operates as follows:

• when the current buffer B i cur is in the dangerous range (i.e., B i cur ≤ B mịn ),

ABRA selects the loIst version to avoid interruptions (i.e., V i+1 = 1)

• If the current buffer is in the high range (i.e B i cur ≥ B low ), ABRA calculates for the next 3 segments with the goal of either reducing to a loIr quality version if possible or remaining the video quality version

• If the current buffer is in the low range (i.e B mịn < B i cur < B low ), ABRA predicts the version for either 2 or 3 segments, depending on situations in bandwidth decrease If bandwidth encounters a sharp drop, the prediction will cover 3 segments, otherwise 2 segments

Since the number of next segments selected is based on the variation in throughput in real-time, especially if the network condition encounters a sharp drop in throughput, a decision on which versions should be used for the next three segments is also made based on the goal of keeping the video quality mostly stable during a streaming session overall Otherwise, users would perceive the service poorly due to the quality fluctuating all the time

ABRA uses Algorithm 1 to calculate the next three segments to make a suitable decision, taking into account whether to keep the current version to maintain stability or to decrease the video quality version in case of bad bandwidth conditions

Otherwise, version prediction for next 2 segments will be carried out based on the degree of throughput variation To define this degree, I determine ∆T drop - throughput difference threshold Throughput degradation is considered to be a sharp drop if the difference betIen the current throughput and the max throughput measured at the 3 previous segments is greater than this ∆T drop Essentially, the goal of selecting versions for the next segments is to maximize QoE at the last adapted segment and to prevent buffer levels from dropping to the dangerous range In ABRA, the estimated buffer level of segments i + 3 and i + 2 are calculated as follows:

In the uptrend case, adaptation decisions are made based on the following different conditions If the buffer level B i curr falls within the dangerous range, similar to the downtrend case, the video quality version will be switched to the loIst quality version HoIver, if throughput increases strongly again (throughput rapid rise), the version of the previous segment will be applied to this segment

Experimental results

To evaluate the performance of the ABRA solution, at first, I will compare the MBA solution previously proposed in work [22] with the 4 cutting-edge solutions

Aggressive [35], SARA [36], Tran’s [37] and SATE [38] Then, I will show the performance of MBA in comparison with ABRA as the enhanced version of MBA

In our experiment, I set up a testbed that comprises of:

• The IP network betIen the server and the client is emulated by the DummyNet tool in which throughput can be varied

The buffer thresholds are set as follows:

• the margin parameter is set to 0.2

Bandwidth fluctuation is emulated by using 2 trace files in [31] At the server side, I use a 180-second long video extracted from the Big Buck Bunny film [29] The video is partitioned into 2-second segments (i.e 𝜙 = 2 seconds), each of which is then encoded into 9 different versions corresponding to 9 quantization parameters (QPs) as illustrated in Table 1 The encoding process is done using Variable Bitrate (VBR) These 9 versions of each segment are stored on the server, ready for the adaptive streaming process On the client side, the adaptive streaming algorithm ABRA calculates and makes decisions about which suitable versions should be downloaded for each segment, based on the buffer and network conditions as explained in the previous section

Table 3.1: Definition of video quality versions

Version QP Average bitrate (Kbps)

As aforementioned, I apply the QoE model proposed by work [50] to evaluate the effectiveness of our ABRA algorithm versus the other existing solutions

In this section, at first, I will compare the performance of the so-called MBA method (i.e., Medium-Buffer Adaptation algorithm) proposed by a recent work [54] The MBA method was actually proven to out perform some state of the art researches such as Aggressive [35], SARA [36], Tran’s [37] and SATE [38] MBA can solve some problems such as low QoE score achievement during throughput fluctuation in

Aggressive [35]; or buffer drop-down if bandwidth is not sufficient enough in SARA

[36]; or significant degradation in QoE scores sometimes due to attempt to keep highest version in a long period of SATE [38] and Tran’s [37]

In summary, MBA outperforms the other four reference solutions in terms of QoE stability throughout the streaming session and overall QoE score MBA achieves these benefits by flexibly selecting the number of segments while maintaining a minimum secured buffer level for the worst case Moreover, since the number of predicted segments should be determined for the sake of good QoE, versions are finally selected evenly at close intervals, which creates a smooth video with a high QoE score (i.e., high user perception)

As an upgraded version of the MBA, ABRA inherits all of the MBA's advantages while improving performance in all buffer level ranges In this section, I evaluate ABRA's performance by directly comparing it to MBA in the following aspects:

• and (3) the selected version in full session

Note that our playout session is assumed to start after the buffer is full I also use other metrics to test the performance of our live-streaming algorithms, including: average received quality rate (rav) in Kbps, the number of freeze- free sessions (nff ), the number of stalls (nf), the total stall duration (tf) in seconds, the number of switches (nsw), and the switching level Fig 3 and Fig 4 shows the comparison betIen ABRA

(a) QoE scores and corresponding throughput

(b) Landscape of selected segment versions

Figure 3.3: Adaptation performance of ABRA vs MBA with bandwidth trace #1

(a) QoE scores and corresponding throughput

(b) Landscape of selected segment versions

Figure 3.4: MBA in terms of QoE, version and buffer in two different bandwidth traces

The two direct comparisons betIen MBA and ABRA in Figures 3.3 and 3.4 show that the difference in QoE values perceived by users reaches MOS scores of 1.29 and 1.77, respectively This QoE disparity occurs when throughput drops dramatically, highlighting the difference betIen the two algorithms

As I can see, the QoE score of MBA is slightly higher than that of ABRA before throughput drops down (i.e., the "thrp attenuation" event) HoIver, MBA's good performance is only temporary, lasting for a very short period I can see that ABRA can optimize quality for the entire streaming session

As Figure 3.4 illustrates, with both algorithms, the version sometimes drops down and immediately comes back right afterward This fact causes the version to increase and decrease continuously in a short period, leading to QoE degradation This phenomenon occurs more frequently with ABRA than with MBA (e.g., at segments

43, 67, 116, and 116, 117) HoIver, ABRA's overall version is more stable than MBA's Therefore, ABRA's overall quality perceived by clients (i.e., QoE score) is better than MBA's

Figure 3.6: Number of version switches

3.4.3 Abra versus other existing methods

As already described in section IV-B, ABRA shows a slight improvement over its predecessor- MBA In this section I will compare ABRA with the state of the art solutions including: MPC [40], Pensive [42], and Buffer-based [48] under real network conditions The average results of average bitrate, number of switches, time stalling, and the total QoE score are obtained as shown in Figure 3.5, 3.6, 3.7, and 3.8, respectively

Figure 3.7: Number of Time Stallings

In general, as Figure 3.5 shows, ABRA provides an average bitrate that is 10-20% loIr than existing solutions HoIver, as I can see from the behavior of version switching in Figure 3.6, ABRA is the solution that achieves the most stability in deciding versions for the next segments The ABRA algorithm has about 2 times, 3 times, and 5 times feIr version switches than the BB, MPC, and RL algorithms, respectively This helps to keep the user experience stable, avoiding user annoyance due to frequent video quality changes like other methods do On the other hand, ABRA also has a low rebuffering time comparable to a compatible solution that relies on buffers, with approximately 2.5 times, 4 times, and 1.5 times feIr rebufferings when compared to the BB, MPC, and RL algorithms, respectively Thanks to these two factors, as Figure 8 shows, ABRA is the solution that achieves the highest QoE score, which is about 17.55%, 20.41%, and 7.86% higher than the BB, MPC, and RL algorithms, respectively

Additionally, ABRA consistently maintains a buffer level above 20 seconds, a relatively safe buffer level that helps prevent stalling, resulting in video freezing and negatively affecting the user experience Unlike buffer-based methods, which maintain the buffer size at a constant level, ABRA dynamically adjusts the buffer level to avoid depletion when throughput reduces This prevents the most significant disadvantage of buffer-based systems: too much version change betIen segments due to buffering concerns In our opinion, consumers will value keeping steady video quality more than maintaining a stable buffer size because users will only perceive a difference when the buffer is empty, i.e., stalling Thanks to the two characteristics mentioned earlier, ABRA can be considered an algorithm that achieves the highest QoE level, as shown in Figure 8 Although MPC always decides to achieve the best level of video quality, this causes buffer levels to fall below safe levels and causes a lot of rebuffering, which prolongs user wait times and makes its QoE the worst Besides, Pensive, a solution based on reinforcement learning, strikes a good balance betIen improving video quality and maintaining stable buffer levels HoIver, its compatibility with network data does not provide a good user experience when there are too many version changes betIen segments

Therefore, I conclude that ABRA has optimized the trade-off betIen image quality and safe buffer level to achieve the best user experience among existing solutions.

Conclusions and future work

In this research, I have proposed a QoE-driven video adaptation method over HTTP, ABRA ABRA can flexibly select versions by adapting to bandwidth fluctuations based on throughput variations and the client's status ABRA's advantage is that it can work stably in all different ranges of buffer level statuses, keeping a high QoE score while maintaining stability for an extended period This fact makes ABRA stand out from the existing state-of-the-art adaptive streaming schemes In our future work, I plan to conduct additional experiments in more diverse bandwidth scenarios to gain a deeper insight into ABRA's performance and improve the solution by addressing its rough aspects

I have seen an increase in video streaming over the Internet, particularly during the COVID-19 pandemic, which may exceed network resource availability In addition to upgrading network infrastructure, finding a way to intelligently adapt the streaming system to network and user conditions to satisfy clients' perceptions is critical This paper proposes ABRA, a new QoE-aware adaptive streaming scheme over HTTP that makes flexible adaptations based on network and client status Furthermore, I propose a technique that can keep the buffer at an average elevated level for more than ten seconds, thereby limiting rebuffering as a result of unexpected and unpredictable bandwidth changes Even when the average bitrate drops, the algorithm maintains the quality of subsequent versions at a constant level, increasing the QoE Experimental results show that our method improves QoE by 7.86% to 20.41% compared to state-of- the-art methods When compared to existing solutions, ABRA can provide better performance in terms of QoE score in all buffer conditions while maintaining a minimum secured buffer level for the worst case

This master thesis offers an overview of virtual reality and the main research issues of virtual reality Then it goes deeper into the problem of viewport prediction to propose a new viewport prediction method with better results than some recent methods such as GRU [18], LSTM [16], Linear [17], and Last [14] In addition, a new video adaptive algorithm called ABRA has been mentioned that gives better average QoE results than MPC [40], Pensive [42], and Buffer-based [48] Viewport prediction and adaptive 360-degree video represent indispensable trends that are poised to play a pivotal role in the widespread adoption of virtual reality in the future.

[1] Boas, Y A G V "Overview of virtual reality technologies." Interactive

[2] Hu, Fei; Hao, Qi; Sun, Qingquan; Cao, Xiaojun; Ma, Rui; Zhang, Ting;

Patil, Yogendra; Lu, Jiang, "Cyberphysical System With Virtual Reality for

Intelligent Motion Recognition and Training," IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol 47, no 2, pp 34- 363, 20177

[3]http://phenix.itsudparis.eu/jvet/?fbclid=IwAR2OYVLJSBUATLojuzYzGJZzDABn VMWMHY3zSgyEGNgO9pGeM2kB0qQVPEk

[4] https://www.vrs.org.uk/virtual-reality-gear/head-mounted-displays/

[5] P.913, Recommendation ITU-T(2014), Methods for the subjective assessment of video quality, audio quality and audiovisual quality of Internet video and distribution quality television in any environment Available: https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-P.913-201603-

[6] Z Wang, A C Bovik, H R Sheikh, and E P Simoncelli, “Image quality assessment: From error visibility to structural similarity,” IEEE

Transactions on Image Processing, vol 13, no 4, pp 600–612, Apr

[7] L Zhang, L Zhang, X Mou, and D Zhang, “Fsim: A feature similarity index for image quality assessment,” IEEE Transactions on Image

Processing, vol 20, no 8, pp 2378–2386, Aug 2011

[8] Z Wang and Q Li, “Information content Iighting for perceptual image quality assessment,” IEEE Transactions on Image Processing, vol 20, no 5, pp 1185–1198, May 2011

[9] Z Wang, E P Simoncelli, and A C Bovik, “Multiscale structural similarity for image quality assessment,” in The Thrity-Seventh Asilomar

Conference on Signals, Systems Computers, vol 2, Nov 2003, pp 1398–

[10] L Zhang, L Zhang, and X Mou, “Rfsim: A feature based image quality assessment metric using riesz transforms,” in 2010 IEEE International

Conference on Image Processing, Sep 2010, pp 321–324

[11] Wang, Zhou & Bovik, Alan (2002) A Universal Image Quality Index

[12] D V Nguyen, H T T Tran, and T C Thang, “An evaluation of tile selection methods for viewport-adaptive streaming of 360-degree video,” ACM Trans

Multimedia Comput Commun Appl., vol 16, no 1, mar 2020 [Online] Available: https://doi.org/10.1145/3373359

[13] D V Nguyen, H T Tran, and T C Thang, “Impact of delays on 360-degree video communications,” in 2017 TRON Symposium (TRONSHOW) IEEE, 2017, pp 1–6

[14] F Qian, L Ji, B Han, and V Gopalakrishnan, “Optimizing 360 video delivery over cellular networks,” in Proceedings of the 5th Workshop on All Things Cellular:

Operations, Applications, and Challenges, ser ATC ’16 New York, NY, USA:

Association for Computing Machinery, 2016, p 1–6 [Online] Available: https://doi.org/10.1145/2980055.2980056

[15] Y Bao, H Wu, T Zhang, A A Ramli, and X Liu, “Shooting a moving target: Motion-prediction-based transmission for 360-degree videos,” in 2016 IEEE

International Conference on Big Data (Big Data), 2016, pp 1161–1170

[16] C.-L Fan, J Lee, W.-C Lo, C.-Y Huang, K.-T Chen, and C.-H Hsu,

“Fixation prediction for 360 video streaming in head-mounted virtual reality,” in

Proceedings of the 27th Workshop on Network and Operating Systems Support for Digital Audio and Video, 2017, pp 67–72

[17] Y Ban, L Xie, Z Xu, X Zhang, Z Guo, and Y Wang, “: Exploiting cross- users' behaviors for viewport prediction in 360 video adaptive streaming,” in 2018

IEEE International Conference on Multimedia and Expo (ICME), 2018, pp 1–6

[18] C Wu, R Zhang, Z Wang, and L Sun, “A spherical convolution approach for learning long-term viewport prediction in 360 immersive videos,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol 34, no 01, 2020, pp 14 003–14

[19] M Hosseini and V Swaminathan, “Adaptive 360 vr video streaming: Divide and conquer,” in 2016 IEEE International Symposium on Multimedia (ISM), 2016, pp 107–110

[20] D V Nguyen, H T T Tran, A T Pham, and T C Thang, “A new adaptation approach for viewport-adaptive 360-degree video streaming,” in 2017 IEEE

International Symposium on Multimedia (ISM), 2017, pp 38–44

[21] A Zare, A Aminlou, M M Hannuksela, and M Gabbouj, “Hevc-compliant tile-based streaming of panoramic video for virtual reality applications,” in

Proceedings of the 24th ACM International Conference on Multimedia, ser MM’16

New York, NY, USA: Association for Computing Machinery, 2016, p 601–605 [Online] Available: https://doi.org/10.1145/2964284

[22] D V Nguyen, H T T Tran, A T Pham, and T C Thang, “An optimal tile- based approach for viewport-adaptive 360-degree video streaming,” IEEE Journal on

Emerging and Selected Topics in Circuits and Systems, vol 9, no 1, pp 29–42, 2019

[23] S Petrangeli, V Swaminathan, M Hosseini, and F De Turck, “An http/2-based adaptive streaming framework for 360° virtual reality videos,” in Proceedings of the

25th ACM International Conference on Multimedia, ser MM ’17 New York, NY,

USA: Association for Computing Machinery, 2017, p 306–314 [Online]

[24] Z Xu, X Zhang, K Zhang, and Z Guo, “Probabilistic viewport adaptive streaming for 360-degree videos,” in 2018 IEEE International Symposium on Circuits and Systems (ISCAS), 2018, pp 1–5

[25] C Li, W Zhang, Y Liu, and Y Wang, “Very long-term field of view prediction for 360-degree video streaming,” in 2019 IEEE Conference on Multimedia

Information Processing and Retrieval (MIPR) IEEE, 2019, pp 297– 302

[26] X Feng, V Swaminathan, and S Ii, “Viewport Prediction for live 360-degree mobile video streaming using user-content hybrid motion tracking,” Proc

ACM Interact Mob Iarable Ubiquitous Technol., vol 3, no 2, jun 2019 [Online]

[27] J Heyse, M T Vega, F de Backere, and F de Turck, “Contextual bandit learning-based viewport prediction for 360 videos,” in the 2019 IEEE Conference on

Virtual Reality and 3D User Interfaces (VR), 2019, pp 972–973

[28] L Sun, Y Mao, T Zong, Y Liu, and Y Wang, “Flockingbased live streaming of 360-degree video,” in Proceedings of the 11th ACM Multimedia Systems

Conference, ser MMSys ’20 New York, NY, USA: Association for Computing

Machinery, 2020, p 26–37 [Online] Available: https://doi.org/10.1145/3339825.3391856

[29] A T Nasrabadi, A Samiei, and R Prakash, “Viewport Prediction for 360° videos: A clustering approach,” in Proceedings of the 30th ACM Workshop on

Network and Operating Systems Support for Digital Audio and Video, ser NOSSDAV

’20 New York, NY, USA: Association for Computing Machinery, 2020, p 34–39

[Online] Available: https://doi.org/10.1145/3386290.3396934

[30] J Park, M Wu, K.-Y Lee, B Chen, K Nahrstedt, M Zink, and R Sitaraman,

“Seaware: Semantic aware view prediction system for 360-degree video streaming,” in

2020 IEEE International Symposium on Multimedia (ISM), 2020, pp 57–64

[31] T C Thang, H T Le, A T Pham, and Y M Ro, “An evaluation of bitrate adaptation methods for HTTP live streaming,” IEEE Journal on Selected Areas in

Communications, vol 32, no 4, pp 693–705, April 2014

[32] X Corbillon, F De Simone, and G Simon, “360- degree video head movement dataset,” in Proceedings of the 8th ACM on Multimedia Systems Conference, ser MMSys’17 New York, NY, USA: Association for Computing Machinery, 2017, p 199–204 [Online] Available: https://doi.org/10.1145/3083187.3083215

[33] D Nguyen, “An evaluation of viewport estimation methods in 360-degree video streaming,” in 2022 7th International Conference on Business and Industrial

[35] T C Thang, H T Le, H X Nguyen, A T Pham, J W Kang and Y M

Ro,"Adaptive video streaming over HTTP with dynamic resource estimation,"in

Journal of Communications and Networks, vol 15, no 6, pp 635-644,Dec 2013, doi: 10.1109/JCN.2013.000112

[36] P Juluri, V Tamarapalli and D Medhi, "SARA: Segment aware rate adaptation algorithm for dynamic adaptive streaming over HTTP," 2015

IEEE International Conference on Communication Workshop (ICCW),

[37] Huyen T.T Tran, Hung T Le, Nam Pham Ngoc, Anh T Pham, and Truong Cong Thang, "Quality Improvement for Video On-Demand Streaming over HTTP", IEICE Transactions on Information and Systems, 2017, Volume E100.D, Issue 1, Pages 61-64, Released January 01, 2017, Online ISSN 1745-1361, Print ISSN 0916-

8532, https://doi.org/10.1587/transinf.2016MUL0005

[38] W Choi and J Yoon, "SATE: Providing Stable and Agile Adaptation in HTTP- Based Video Streaming," in IEEE Access, vol 7, pp 26830-26841, 2019, doi 10.1109/ACCESS.2019.2901279

[39] A Bentaleb, B Taani, A C Begen, C Timmerer and R Zimmermann, "A Survey on Bitrate Adaptation Schemes for Streaming Media Over HTTP," in IEEE Communications Surveys Tutorials, vol 21, no 1, pp 562-585, Firstquarter

[40] Chenghao Liu, Imed Bouazizi, and Moncef Gabbouj 2011 Rate adaptation for adaptive HTTP streaming In Proceedings of the second annual ACM conference on

Multimedia systems (MMSys ’11) Association for Computing Machinery, New York,

NY, USA, 169–174.DOI:https://doi.org/10.1145/1943552.1943575

[41] D V Nguyen, H T T Tran, Pham Ngoc Nam and T C Thang, "A QoS- adaptive framework for screen sharing over the Internet," 2016 Eighth International

Conference on Ubiquitous and Future Networks (ICUFN), Vienna, 2016, pp 972-974, doi: 10.1109/ICUFN.2016.7536942

[42] Z Li et al., "Probe and Adapt: Rate Adaptation for HTTP Video Streaming At Scale," in IEEE Journal on Selected Areas in Communications, vol 32, no 4, pp 719-

[43] Te-Yuan Huang, Ramesh Johari, Nick McKeown, Matthew Trunnell, and Mark Watson 2014 A buffer-based approach To rate adaptation: evidence from a large video streaming service.SIGCOMM Comput Commun Rev 44, 4 (October 2014), 187–198 DOI:https://doi.org/10.1145/2740070.2626296

[44] C Mueller, S Lederer, R Grandl and C Timmerer, "Oscillation compensating Dynamic Adaptive Streaming over HTTP," 2015 IEEE International

Conference on Multimedia and Expo (ICME), Turin, 2015, pp 1-6, doi:

[45] K Spiteri, R Urgaonkar and R K Sitaraman, "BOLA: Near-optimal bitrate adaptation for online videos," IEEE INFOCOM 2016 - The 35th Annual IEEE

International Conference on Computer Communications, San Francisco, CA, 2016, pp 1-9, doi: 10.1109/INFOCOM.2016.7524428

[46] Abdelhak Bentaleb, Ali C Begen, Saad Harous, and Roger Zimmermann

2018 Want to play DASH? a game theoretic approach for adaptive streaming over HTTP In Proceedings of the 9th ACM Multimedia Systems

Conference (MMSys ’18) Association for Computing Machinery, New

York, NY, USA, 13–26 DOI:https://doi.org/10.1145/3204949.3204961

[47] K Miller, E Quacchio, G Gennari and A Wolisz, "Adaptation algorithm for adaptive streaming over HTTP," 2012 19th International Packet Video

Workshop (PV), Munich, 2012, pp 173-178, doi: 10.1109/PV.2012.6229732

[48] Cong Wang, Amr Rizk, and Michael Zink 2016 SQUAD: a spectrum based quality adaptation for dynamic adaptive streaming over HTTP In

Proceedings of the 7th International Conference on Multimedia Systems

(MMSys ’16) Association for Computing Machinery, New York, NY,

USA, Article 1, 1–12 DOI:https://doi.org/10.1145/2910017.2910593

[49] A Beben, P Wisniewski, J Mongay Batalla, and P Krawiec 2016 ´

ABMA+: a lightIight and efficient algorithm for HTTP adaptive streaming In

Proceedings of the 7th International Conference on Multimedia

Systems (MMSys ’16) Association for Computing Machinery, New York,

NY, USA, Article 2, 1–11 DOI:https://doi.org/10.1145/2910017.2910596

[50] H T T Tran, N P Ngoc, A T Pham and T C Thang, "A Multi-Factor

QoE Model for Adaptive Streaming over Mobile Networks," 2016 IEEE Globecom Workshops (GC Wkshps), 2016, pp 1-6, doi: 10.1109/GLOCOMW.2016.7848818 [51] Big Buck Bunny, Xiph.org Test Media, https://media.xiph.org/

[52] Huyen T T TRAN, Nam PHAM NGOC, Yong Ju JUNG, Anh T PHAM, Truong Cong THANG, A Histogram-Based Quality Model for HTTP

Adaptive Streaming, IEICE Transactions on Fundamentals of Electronics,

Communications and Computer Sciences, 2017, Volume E100.A, Issue

2, Pages 555-564, Released February 01, 2017, Online ISSN 1745-1337,

Print ISSN 0916-8508, https://doi.org/10.1587/transfun.E100.A.555

[53] C Müller, S Lederer, and C Timmerer, "An evaluation of dynamic adaptive streaming over HTTP in vehicular environments" In Proceedings of the 4th

Workshop on Mobile Video (MoVid ’12) Association for Computing Machinery, New

York, NY, USA, 37–42 DOI:https://doi.org/10.1145/2151677.2151686

[54] C T Dac et al., "QoE-aware Video Adaptive Streaming over HTTP," 2020 IEEE Eighth International Conference on Communications and Electronics (ICCE),

[55] S Kumar, R Devaraj, A Sarkar, and A Sur, "Client-Side QoE Management for SVC Video Streaming: An FSM Supported Design Approach," in IEEE Transactions on Network and Service Management, vol 16, no

[56] D J Vergados, A Michalas, A Sgora, D D Vergados and P Chatzimisios,

"FDASH: A Fuzzy-Based MPEG/DASH Adaptation Algorithm," in IEEE Systems Journal, vol 10, no 2, pp 859-868, June 2016, doi:

[57] Rahman, W.u.; Hossain, M.D.; Huh, E.-N Fuzzy-Based Quality Adaptation Algorithm for Improving QoE from MPEG-DASH Video Appl Sci

[58] W u Rahman and K Chung, "Buffer-Based Adaptive Bitrate Algorithm for Streaming over HTTP," KSII Transactions on Internet and Information Systems, vol

[59] Sobhani, A.; Yassine, A.; Shirmohammadi, S A fuzzy-based rate adaptation controller for DASH In Proceedings of the 25th ACM Workshop on Network and Operating Systems Support for Digital Audio and Video,

[60] J Jiang, V Sekar, and H Zhang Improving fairness, efficiency, and stability in HTTP-based adaptive video streaming with Festive IEEE/ACM

[61] C Zhou, C Lin, X Zhang, and Z Guo Tfdash: A fairness, stability, and efficiency-aware rate control approach for multiple clients over dash IEEE Transactions on Circuits and Systems for Video Technology,

[62] Y Xiaoqi, J Abhishek, S Vyas, and S Bruno A control-theoretic approach for dynamic adaptive video streaming over HTTP In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication, SIGCOMM ’15, pages 325–338, New York, NY, USA, 2015

[63] A H Zahran, D Raca and C J Sreenan, "ARBITER+: Adaptive RateBased InTElligent HTTP StReaming Algorithm for Mobile Networks," in

IEEE Transactions on Mobile Computing, vol 17, no 12, pp 2716-2728,

[64] M Hongzi, N Ravi, and A Mohammad Neural adaptive video streaming with pensieve In Proceedings of the Conference of the ACM Special

Interest Group on Data Communication, SIGCOMM ’17, pages 197–210,

New York, NY, USA, 2017 ACM

[65] Xuekai Ii, Mingliang Zhou, Sam Kwong, Hui Yuan, Shiqi

Wang, Guopu Zhu, Jingchao Cao, Reinforcement learning-based

QoE-oriented dynamic adaptive streaming framework, Information

Tiêu đề	Optimization Of 360-Degree Video Transmission
Tác giả	Pham Ngoc Son
Người hướng dẫn	Associate Professor Truong Thu Huong
Trường học	Hanoi University of Science and Technology
Chuyên ngành	Electronic Engineering
Thể loại	Master Thesis
Năm xuất bản	2023
Thành phố	Hanoi

Định dạng
Số trang	74
Dung lượng	1,17 MB