开发者

reading file large file very slow, please help

开发者 https://www.devze.com 2023-03-31 06:11 出处:网络
this code takes about 30 mins and high cpu usage, what is the problem Do strLine = objReader.ReadLine()

this code takes about 30 mins and high cpu usage, what is the problem

Do
  strLine = objReader.ReadLine()
  If strLine Is Nothing Then
    Exit Do
  End If
  'check valid proxy
  m = Regex.Match(strLine.Trim, strProxyParttern)
  strMatch = m.Value.Trim
  If String.IsNullOrEmpty(strMatch) = True OrElse _
    strMatch.Contains("..") = True Then
    Continue Do
  End If
  ' create proxy
  With tmpProxy
    .IP = strMatch.Substring(0, strMatch.IndexOf(":"))
    .Port = CInt(strMatch.Substring(strMatch.IndexOf(":") + 1))
    .Status = "new"
  End With
  ' check 
  If lstProxys.Contains(tmpProxy) = True Then
    Continue Do
  End If
  lstProxys.Add(tmpProxy)
  Debug.Print(lstProxys.Count.ToString)
Loop Until strLine Is Nothing
If lstProxys.Count < 1 Then
  Exit Sub
End If

is the slowness from the comparism or from reading the file or from the regex?

EDIT

profiling the code like this

 Dim myTimer As New System.Diagnostics.Stopwatch()
        Dim t1 As Integer = 0
        Dim t2 As Integer = 0
        Dim t3 As Integer = 0
        'read the file line by line, collecting valid proxy
        Do
            'Read a line fromn the file
            myTimer.Reset()
            myTimer.Start()
            strLine = objReader.ReadLine()
            If strLine Is Nothing Then
                Exit Do
            End If
            myTimer.Stop()
            t1 = myTimer.Elapsed.Milliseconds
            'check valid proxy
            myTimer.Reset()
            myTimer.Start()
            m = Regex.Match(strLine.Trim, strProxyParttern)
            strMatch = m.Value.Trim
            If String.IsNullOrEmpty(strMatch) = True OrElse _
                strMatch.Contains("..") = True Then
                Continue Do
            End If
            myTimer.Stop()
            t2 = myTimer.Elapsed.Milliseconds
            ' create proxy
            myTimer.Reset()
            myTimer.Start()
            tmpProxy.IP = strMatch.Substring(0, strMatch.IndexOf(":"))
            tmpProxy.Port = CInt(str开发者_StackOverflow中文版Match.Substring(strMatch.IndexOf(":") + 1))
            tmpProxy.Status = "new"

            ' check 
            If lstProxys.Contains(tmpProxy) = True Then
                Continue Do
            End If
            lstProxys.Add(tmpProxy)
            myTimer.Stop()
            t2 = myTimer.Elapsed.Milliseconds
            Debug.Print(String.Format("Read={0}, Match={1}, Add={2}", t1, t2, t3))
        Loop Until strLine Is Nothing

gave these results

Read=0, Match=0, Add=1
Read=0, Match=0, Add=1
Read=0, Match=0, Add=2
...
Read=0, Match=0, Add=9
Read=0, Match=0, Add=9
Read=0, Match=0, Add=10
...
...
Read=0, Match=0, Add=39
Read=0, Match=0, Add=39
Read=0, Match=0, Add=40
etc

looks like the code is ok right, except for the add to the list


The speed issue is because you are using a List(Of Structure). The List.Contains method is a linear search (it goes through each item of the list to see if it matches) so it takes increasingly longer the more unique items you add to the list.

Because you're dealing with a large number of items, change lstProxys into a HashSet(Of T). You should see a huge performance boost. All you should need to do is change the definition of lstProxys:

Dim lstProxys as New HashSet(Of structure)


The disk I/O is usually the limiting factor for something like this. Depending on the disk speed you could expect a throughput of about 5-20 megabyte per second.

Regular expressions can be slow if they contain expressions that cause a lot of backtracking, so that is a possibility, but it should be pretty bad to be noticable compared to the disk I/O.

As there will never be more than one item in the proxy list, that comparion can't be the problem. You are not creating any new proxy object, but reusing the same, which means that you change the property of the object that you have already put in the list. As you are comparing the object with itself, the list will always contain the object after the first iteration, and will never be added a second time.

Does the proxy class do anything when you assign values to its properties? If it does something like creating a connection, that might be what's taking so long.


is the slowness from the comparism or from reading the file or from the regex?

We could take educated guesses but why not measure it instead.

For example run the following three tests separately under release mode and without the debugger attached and see how long it takes

'Test 1 Just IO

Do
  strLine = objReader.ReadLine()

Loop Until strLine Is Nothing
If lstProxys.Count < 1 Then
  Exit Sub
End If

'Test 2 IO + Regex

Do
  strLine = objReader.ReadLine()
  If strLine Is Nothing Then
    Exit Do
  End If
  'check valid proxy
  m = Regex.Match(strLine.Trim, strProxyParttern)
  strMatch = m.Value.Trim
  If String.IsNullOrEmpty(strMatch) = True OrElse _
    strMatch.Contains("..") = True Then
    Continue Do
  End If

Loop Until strLine Is Nothing
If lstProxys.Count < 1 Then
  Exit Sub
End If

'Test 3 IO + regex and Compare
Do
  strLine = objReader.ReadLine()
  If strLine Is Nothing Then
    Exit Do
  End If
  'check valid proxy
  m = Regex.Match(strLine.Trim, strProxyParttern)
  strMatch = m.Value.Trim
  If String.IsNullOrEmpty(strMatch) = True OrElse _
    strMatch.Contains("..") = True Then
    Continue Do
  End If
  ' create proxy
  With tmpProxy
    .IP = strMatch.Substring(0, strMatch.IndexOf(":"))
    .Port = CInt(strMatch.Substring(strMatch.IndexOf(":") + 1))
    .Status = "new"
  End With
  ' check 
  If lstProxys.Contains(tmpProxy) = True Then
    Continue Do
  End If
  lstProxys.Add(tmpProxy)
  Debug.Print(lstProxys.Count.ToString)
Loop Until strLine Is Nothing
If lstProxys.Count < 1 Then
  Exit Sub
End If
0

精彩评论

暂无评论...
验证码 换一张
取 消