Problems with Kaltura
We have announced several times that now it should be fine, because those were the messages we have received - but unfortunately, it has turned not turned out that way.
The short message is now that work is being done - day and night - to get it back in order, and that we hope to be finished on Monday morning (March 15). Looking back at the history and the nature of the problems, however, it is not something we dare to promise.
We send out the notifications about the current status of the service via Serviceinfo.dk, from day to day, and from hour to hour, if there is news.
NORDUnet operates, for the Nordic research networks, a central Kaltura service, which DeiC has offered since 2018. Here during the Covid crisis, consumption has increased by around 10 times. It is thus the world's largest 'on-premise installation' of Kaltura.
One of the strengths of Kaltura is that there is a central database which manages all the materials managed by the service. It was getting so big that it in itself created performance problems, and on the advice of Kaltura, in the autumn of 2020 it was split into two halves, one with the materials and one with statistics.
It worked as it should, except for the error that the administrators of the institutions could no longer see how many times the individual video had been viewed. To rectify this, the Kaltura support (without informing NORDUnet) launched a script on Friday 5 March, which was to rectify this and get the statistics figures back into the Kaltura Management Console. However, this script had the side effect of bringing the index database out of sync.
The problems that all this had created inside Kaltura only became visible when the use of the service picked up speed again on Monday morning, and initially turned out to be mostly performance problems. Following advice from Kaltura, the import script was stopped.
Kaltura's advice on how to fix the index database again was then to take one index server out of rotation and copy the database from it to the others. Every time you have to perform such an operation, many hours go by because it is such a large amount of data and therefore it took several hours to find that it did not solve the problem either.
On Tuesday, Kaltura instead started a script that would correct the database. Again, such an operation takes a long time, which was the explanation for having sought other options first. However, the sync script failed at some point, and after common debugging (NORDUnet and Kaltura), a new version of the script could be started later in the day.
On Wednesday, Kaltura advised to spread the load further by deploying three more Kaltura Media Space servers to improve the user experience. It got NORDUnet put into production the same day.
On Thursday, Kaltura also gave advice on deploying an additional indexing server. It also happened the same day.
Kaltura now has developer teams to work in two-team shifts on a solution to the problems, and there is not much else to do about that than wait.
In the meantime, it has been made very clear to Kaltura that scripts and other support operations on the installation must not be performed without first being evaluated and approved by NORDUnet - no matter how trivial or unproblematic they may seem to Kaltura to be.
All parties are now working to ensure that matters are put in order when we return on Monday morning. But we are dealing with developer teams who have to find and fix software errors at the same time that virtually any operation you want to test on the installation takes a significant amount of time because the amounts of data are so large. Therefore, it is not possible to promise results at a very specific time.
Status is continuously announced via serviceinfo.dk where Kaltura has its own channel.
This whole process, and the way the service has run over the past week, certainly does not live up to the general standards we normally have for the operation of this type of service.
But we are not in a normal situation. In many ways we are conquering new land here. Distance learning has existed for many years but to include video as massively both via Zoom and Kaltura / Panopto / MediaSite as it happens here and on that scale has not been seen before - not even in the rest of the world.
Virtually all software and operating system vendors in this area have had issues from time to time during the Covid era. We have seen this both with the things we ourselves offer in DeiC and with the services that can be bought commercially and which are hosted in international cloud environments.
These service platforms are not something you just install, operate and receive software updates for. As soon as large-scale scaling is involved, sustained and close collaboration between software developers and operating operators is required. Much of the software that runs these services has been developed for some other usage scenarios and does not all scale equally elegantly - no matter who they are and no matter what operating provider it is.
Then one might think that we could just run these services separately in many small installations. Apart from the fact that it would be significantly more expensive than it is today, it is often not an option. Providers no longer support individual installations, but usually only offer
the services provided from their own cloud service, where they also experience the same kind of scaling problems that are the reason for this article.
We must certainly do our utmost to ensure stable operation, but just as it takes time to build new lanes on a congested highway, we unfortunately also have to adjust to "road work" now and then while software and infrastructure are brought to to scale by up to a factor of 200, as has happened with some services here during the pandemic.
Seen over a period of time, here during the Covid era, there is no provider of video services and operations that can claim to be completely free of problems. The same also applies to telecom operators.
We have just been blessed that we have not had any serious problems with the Research Network or Zoom yet, and we will constantly compare with that - but that comparison is probably not fair. The research network and Zoom are not the norm - so it is the exception in this world.
Head of Forskningsnettet