1

programmers come

posted in Off Topic
Comments:
Threaded Linear
#1
yukky

i wanted to edit my webscraper to basically get a user's most common saying, but obviously a person can say the same thing in multiple ways

how would i go about evaluating this? i was thinking of using either levenshtein or hamming distance since both have libraries supporting them

i obviously dont wanna slow down the scraping process too much when evaluating this, so how should i go about it?

#2
marlow
0
Frags
+

tbh i think that you'll be able to find an open-source program to do this for you, but if you wanted to do it yourself i think you could try to tokenise keywords - so it evaluates a specific order of tokens as the target phrase. idk how you'd actually code this though

#5
yukky
0
Frags
+

I have an idea of how I could do it, I just need to find the most efficient way cause I don't wanna brick my database or make the program 3x slower

#6
marlow
0
Frags
+

yeah im working on a webscraper from VLR atm but for stats only, messages sounds horrific in comparison. Don't have a good enough understanding for evaluating efficiency. Although I think there are fuzzy matching packages for the sort of stuff you're looking into.

#7
yukky
0
Frags
+

What are you scraping?

#11
marlow
0
Frags
+

trying to make a VLR fantasy league site, so converting the stats page - with relevant parameters - into csv file/mysql database. pretty low level but ive got it working alright now. using it for a stats page on my site, and to be able to have the data for when players want to have their own teams. makes it a lot more efficient than sending GET requests for everything client side.

def scrapeStats(url):
getResponse = requests.get(url)
#verifying successful request response
if getResponse.statuscode == 200:
#parse HTML content
doc = BeautifulSoup(getResponse.content, "html.parser")
#saving only table data
table = doc.find(class
='wf-card mod-table mod-dark')
headers = []
rows = []
#extract and fill the header and row arrays with table data
for th in table.find_all('th'):
headers.append(th.text.strip())
for tr in table.find_all('tr')[1:]:
row = []
for td in tr.find_all('td'):
row.append(td.text.strip())
rows.append(row)
return headers, rows
else:
#in case of unsuccessful request response
print("Request Error")
return None, None

def clean_data(headers, rows):
#appending new team column
for row in rows:
player_name = row[0]
team_name = player_name.split()[-1]
row.append(team_name)
#deleting redundant agents column
agents_index = headers.index('Agents')
for row in rows:
del row[agents_index]
del headers[agents_index]

def save_to_csv(headers, rows, fileName):
with open(fileName, 'w', newline='') as csvFile:
writer = csv.writer(csvFile)
#write headers and rows
#create team header
writer.writerow(headers + ['Team'])
for row in rows:
writer.writerow(row)
print(f"Data has been saved to {fileName}")

def main():
#change this depending on the vlr stats page that you want it to scrape from
url = "https://www.vlr.gg/event/stats/1921/champions-tour-2024-masters-madrid?exclude=&min_rounds=0&agent=all"
#created CSV file name
fileName = 'vlr_stats.csv'
#scraping data
headers, rows = scrapeStats(url)
#check data is retrieved successfully
if headers and rows:
#clean data
clean_data(headers, rows)
#save data to CSV file
save_to_csv(headers, rows, fileName)

if name == "main":
main()

#13
yukky
0
Frags
+

pretty cool pretty cool

tbh for the urls you could just have a txt file of all the urls you wanna scrape and have the code read the urls in from that so u dont have to change up the program for each new url if that makes sense

so u like have like
url1
url2
url3
in one file

#15
marlow
0
Frags
+

yeah thats a good point, thanks. I'll set the whole file up as a function to call for different parts of the site I think. Its very bare bones atm and it only works for individual leagues during splits (it gets really complicated when teams are playing different amounts of matches).

#16
marlow
0
Frags
+

should have it set up for users by september, depends how often i get to work on it

i unfortunately got beaten to the idea by another guy (cant remember his sites name), but im mainly just doing this for a personal project

#3
Nachtel
-1
Frags
+

wdym in multiple ways? if you mean the saying is written between different types of texts then you just get the entire reply/comment.

If you're using python, look for the saying within the big block of text with something like:

(pretend the periods are spaces, I cant get this shit to indent)

Instances = []
for text_block in user_history:
....If string1 in text: #Returns true if there is any instance of the string you provided within the larger string (the block of text)
...........Instances.append(text):

#4
yukky
0
Frags
+

No I meant for example the following strings:

"yay is good"
"y0y is good"
"good is y0y"
"good is yay"

Are practically the same to the human eye, but comparison wise they are different. Thats why I want to use levenshtein or hamming distance. But also the code you provide is kind of slow and I'm looking for a faster solution since I'll most likely be doing this for every users post

#17
Nachtel
-1
Frags
+

Haha yeah I apologize there's definitely more optimized ways to do it.

Would speed still be an issue if you ran the program concurrently using multiprocessing, with an instance for every user profile?

#definitely a way to automate and optimize this part but no time rn
look_for = ["yay is good", "y0y is good", "good is y0y", "good is yay", "good yay is", "good y0y is", "y0y good is", "yay good is"]

for comment in message_history:
...for instance in look_for:
......if instance in comment:
.........Do_thing()

It's an ugly solution but would this work for what you're trying to do?

#18
marlow
1
Frags
+

this would genuinely take a bajillion years

#19
yukky
0
Frags
+

Somewhat, it's a starting point but I think I'm still going to go with hamming distance or something because it requires less comparison I think.

I'm already doing the webscraping concurrently with around ~10-14 pages per second depending on my connection.

This honestly might get complicated too fast, might have to drop the idea ngl lol

#20
marlow
0
Frags
+

i think hamming distance relies on the object and target being the same size/length though, so might not work well

#21
Nachtel
-1
Frags
+

Depending on how complicated the phrase he's looking for, the size of the object/target shouldn't be an issue if he applies padding

As in, take splices out of the object with the same length as the target or pad out the target if it doesn't meet the length of the splice

#22
marlow
0
Frags
+

honestly i dont know enough about it in practice to know if it'd work well, but that makes sense

#8
Psion
0
Frags
+

so basically what i did is hire cheap foreign child labor from china
(for legal reasons this is a joke)

#9
yukky
-1
Frags
+

i know a few people who have gotta their minimum viable product by just out sourcing the set up to india, and just taking the code

(obviously after paying)

#14
Psion
1
Frags
+

LOL

#10
wiki
1
Frags
+

one day, we may hope for a VLR public API :(

#12
yukky
0
Frags
+

the mods barely update this website

i dont think they're gonna make a public api...

even so any valorant related api is usually locked and keys are given to a select few

#23
wiki
0
Frags
+

it's a lil annoying that this is the case more globally. i get that they don't want to have to deal with GDPR etc, but non-personal data such as match statistics etc should be accessible even from a Riot API

#24
Nachtel
-1
Frags
+

Match statistics used to be publically accessible via riot API, then people abused it to leak pro team's scrim results

So they restricted access to companies that pass a vetting process like tracker.gg

#25
wiki
1
Frags
+

then just don't have a public endpoint for scrim results lol, seems weird to have an all or nothing model

  • Preview
  • Edit
› check that that your post follows the forum rules and guidelines or get formatting help
Sign up or log in to post a comment