Live streaming and audio sharing platforms have high numbers of audios/videos that need to be understood for quality inspection, tagging, and recommendation purposes, which is difficult to be achieved by human. The real-time speech recognition feature of ASR can transcribe audios and audio streams in videos based on the audio/video transcription model. It well satisfies the different latency requirements of different input sources and helps platform staff quickly understand high numbers of audios/videos, which remarkably reduces the labor costs and quickly implement quality inspection, tagging, and recommendation.