Turkish NLP 2021 (Teknofest)

Tracking System

I initially wasn't planning on participating in Teknofest in 2021, however with my dad's encouragement I decided to join this competition, since I have worked with AI in some of my projects before. The project I made for the competition could be anything, as long as it had to do with language processing, specifically Turkish.

The first thing I had to do was actually come up with an idea for a project, which was much easier said than done. After some brainstorming I decided to try to make an AI that would successfully be able to label social media comments as "toxic" or "not toxic". After coming up with what I was going to do, I did some research on how exactly NLP programs were made, and what tools I would need. I set up Huggingface and PyTorch, and made a sentiment analyzer using a dataset of IMDb reviews. After I was able to get a functioning train/test, I started to look for ways I could make a dataset, which would prove to be the hardest part of the project by far.

I was initially planning on manually getting my data from Twitter and saving it into 2 separate text files, but I immediately opted out of that once I discovered a python module called Twint, which was a data scraper far more capable than Twitters official API. I paired that with Pigeon to quickly label data within Jupyter notebook. Unfortunately acquiring data still wasn't over yet. My first attempt was to simply get tweets tweeted at popular political figures, since those were likely to have a lot of toxicity, however a lot of those tweets were pretty ambiguous on whether they were bad or not. After that I scraped by popular hashtags, but despite searching in Turkey only, I got a significant amount of tweets from other languages as well, so that was a no-go. Finally, I decided to simply make a list of insults and search for tweets with those insults, and then searched through all tweets with some basic words as well to try to balance the data out. This ended up being the best option, so I went with it. I then took all the data and randomized the order, then started labelling. I am not exaggerating when I say this may have been the most tedious experience I ever felt in my life. I ended up going through over 5000 tweets over the course of several days and it was miserable. Not only did I have to do something extremely repetitive for such a long time, reading all those negative comments really took a toll on my sanity. It was funny how every once in a while among all the toxicity a single English/Russian/Japanese comment would show up that was actually friendly. I labelled into 3 categories of "Not toxic", "Political" and "Toxic", as well as an "Unusable" category for comments that were just emojis/links, or were in another language.

After I got through the labelling, I needed to make a small preprocessing function to remove links, hashtags, emojis etc. I also changed the BERT model I trained with to the Turkish model. Once I was done with the changes, all I had left was to train again, this time with my own dataset. I finally trained, which took a few hours. However when I was done with training initially, the accuracy wasn't as good as I had hoped, it was around 90% when I was aiming for at least 95. I decided to merge the "political" category with the "not toxic" to make it less confusing for the AI and trained once again, this time getting over 95% accuracy. When I was happy with it, I uploaded everything to GitHub and started preparing a presentation for the judges at the competition. After the competition was over, I ended up doing pretty well as the only solo competitor, getting in the top 7. I feel like if I had made something less generic than a sentiment analyzer I could have easily gotten at least top 3. This was my first time doing something completely AI based and overall I would consider it a success.