504 timeout errors - how many suffering?

sean.L · June 7, 2021, 8:11am

There are many staff members I agree!

Just to be safe, just include all of us in the PM and you’ll get a response from at least one of us.

The_Professor · June 7, 2021, 10:11am

Sean thanks for acknowledgement of the issue, yes please let us know when the fault and fix progress. Right now we may be in a false hope as if the cause is load related and a whole lot of users given up using you’re in a false reduced load calm.

I PM to 5 as any who showed recent activity, not all 12, of which only 1 - you Sean - read and reply next day.

Chiquinho · June 7, 2021, 10:24am

Seems we did our job well now!

The_Professor · June 7, 2021, 10:37am

Indeed and how many new users do we tell them to email product support and not vent in public as community members can’t fix what they don’t know.

I started the thread to gather more data as AWS caches pages locally so you can get a geographical based issue, but from the geo distribution of the replies I could determine global.

Chiquinho · June 7, 2021, 10:40am

May be “thumb admins” was a little bit rude (Sorry)
But it helped at last!

The_Professor · June 7, 2021, 10:52am

Well no, we are only at the stage of problem acknowledge, not resolved. Right now we should back off and wait for a problem resolved message.

I think we’re in a quiet low load period right now and the timeout will return.

We don’t know who is admin, only a list of staff to scattergun blast via PM, one of which read it a day later.

The 504 error is an issue between the reverse proxy which acts as a cache, and the server. The fix it requires admin of the platform (discourse) to fix logical or DB errors, or the AWS account to fix VM autoscale and VM migration. None of those two admins are the crew who consume the platform like we do but they just have elevated permissions.

I could guess who is the admin, Jade is an early user, but not active now.

Chiquinho · June 7, 2021, 10:55am

I dont know much about this server probs.
Was not my job at all in the past.
I had to care about student’s problems and had to help them out
(tests etc)
Was a lot of psychology, not informatics.

But regarding those server problems :
I knew of course always one who knew it better than I.

The_Professor · June 7, 2021, 11:02am

Right well you about speed of light?

A write has to be necessarily single threaded as only one thread can write, if two threads write they fight each other and corrupt.

As a site like this community has more lurkers and inactive users than active posters, it is common to have multiple caching servers who each hold a read-only copy of the site. When you do a post the write thread does the update then the read thread caches the update.

You scale as many read threads as required to make responsive for reading, often an autoscale parameter. You can’t scale writing the only thing you can do is make it have faster CPU and faster storage.

Hence the quick lesson in web based applications.

Now then, they have a vulnerability. They have 1.3M users. If they all tried to post at same time… The community will die for most users. They also don’t obviously have staff working weekends. Hence it’s only through luck it’s not crashed yet for a weekend.

Chiquinho · June 7, 2021, 11:06am

1,3 millions?
At ANKER’s with similar issues we are about 10 active ones.
Here might be some more.

The_Professor · June 7, 2021, 11:14am

Well that’s exactly the point. The existence of many user accounts should not be a problem as they are mostly inactive.

The fact a new user is added a minute , when I looked, can be an issue as that’s a write event.

Can’t be new posts issue as it’s not particularly busy.

We just have to back off now and wait. They know and saying again won’t help now.

Shenoy · June 7, 2021, 3:12pm

Not sure if Anker and Soundcore are interested in getting more users, with this kind of sluggish community access and errors,

How many more threads and posts are needed to get Admin & Moderator’s attention??

@Loz @Sean.l

Shenoy · June 7, 2021, 3:13pm

Not sure if Anker and Soundcore are interested in getting more users, with this kind of sluggish community access and errors,

How many more threads and posts are needed to get Admin & Moderator’s attention??

@Loz @Sean.l

… got an error as I posted this

The_Professor · June 7, 2021, 3:22pm

Shenoy they know about it, we just have to wait for the admin they informed today for a fix.

Purpose of thread was to find how specific to users or geographic, we seem to have a global issue, they now know and I suggest we give them breathing space now to address.

The staff you see here aren’t the admin and they can only tell admin once…

I’d give admin overnight, so til tomorrow to do a fix. Seems to be single threaded write DB issue so I’d expect probably a short outage to migrate to faster storage or larger memory VM.

So if there is an outage it’s for good reasons.

I’m seeing 502 errors now.

Shenoy · June 7, 2021, 3:34pm

Thanks @The_Professor

Hope this resolves sooner!

sean.L · June 7, 2021, 4:10pm

In the future, if you have any technical issues with the website please PM me @hannah or @william.ward

For your reference, this issue was reported and is currently being looked into, and the community will be updated as soon as I have an update from the IT team.

Shenoy · June 7, 2021, 4:13pm

Thanks @sean.L

The_Professor · June 7, 2021, 5:34pm

The response from the Soundcore Collective crew has been excellent, they acknowledge in a post and reply to PM.

When I reported it in Anker community 13 days ago, no indication it was read, no reply. It’s identical problem so a common cause and hopefully a common fix.

PM is always the way to go, particularly at weekend as it causes an email which can have more chance of being noticed unlike just venting in an assumed read community place, at a weekend. What Hannah asks of us I’d already done.

The outage, we may need, for a 504 type error would impact the ability to edit so no posting or revision but looking, searching, should still work, to put into read-only, shutdown the DB and migration to faster and bring up, back to read write is hours typically. The outage would be faster if full as not tinkering with a running system.

I’d not be surprised if a good admin already asked for an outage and got told can’t as there’s the LA2P launch then the Q35 launch, then the … launch.

The Soundcore community has been adding a user a minute when I looked, and fact both communities slow at same time implies they share same servers so that makes sense, they added the Soundcore community to the Soundcore app, that increases load which bleeds across via shared servers to Anker.

To fix that would involve separating the servers for both communities which is more than hours and so be a weekend outage for both.

That’s the sort of discussion I am imagining now going on in the background.

So our role is to report it (done), give time for a fix (now) and accept outages as required (future).

sean.L · June 8, 2021, 11:14am

Hey everyone, I had a chat with the IT team this morning and they have looked into the problem.

The current update is:

According to the log data service is currently normal, but the servers have been updated to improve and strengthen the service.

If anyone experiences any problems or further issues over the next few days or the weekend again, please message me and I’ll stay on top of it. I have let them know it’s also affecting the Anker community too.

Thanks for all the feedback and help!

The_Professor · June 8, 2021, 11:48am

Thanks for the service update.

It’s good the servers upgraded.

As discussed, yes PM is best way to communicate service issues as complaining inside a thread is going to take longer to get noticed.

However, the initial thoughts many of us have is “is it just me?”. I’d logout/in, wipe cache, reboot router, etc, eliminating myself first. Next I had to figure if local to region.

Example of a regional failure (503)

So I did create the thread to see how widespread was the issue - to see how real it was.

So in future if this repeated I’d likely do both again, a PM to get it noticed and a thread to see if if just me or local to region.

Hopefully we don’t have issues again (soon).

Edit: this message saved fast, first good sign.

The_Professor · June 22, 2021, 1:07pm

I think I’ve managed to PM after many attempts the crew so letting you know they probably now know

Similar errors to before, but now 502